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08/18/2003 HWJ0NG1 00000133 090108 09991212 
01 FC:1402 320.00 Dfl 



111970 



1 



09/991,212 



Docket No.: PF-0221-3DIV 

(1) REAL PARTY IN INTEREST 
The above-identified application is assigned of record to Incyte Pharmaceuticals, Inc. (now 
Incyte Corporation), (Reel 9779, Frame 0302) which is the real party in interest herein. 

(2) RELATED APPEALS AND INTERFERENCES 
Appellants, their legal representative and the assignee are not aware of any related appeals or 
interferences which will directly affect or be directly affected by or have a bearing on the Board's 
decision in the instant appeal. 



Claims rejected: 
Claims allowed: 
Claims canceled: 
Claims withdrawn: 
Claims on Appeal: 



(3) STATUS OF THE CLAIMS 
Claims 3-7, 9, 10, 12, 13, 46, 48, 57, and 58 
none 

Claims 1, 2, 8, 11, 17-27, 30-45, and 49-56 
Claims 14-16, 28, 29, 47, and 59 

Claims 3-7, 9, 10, 12, 13, 46, 48, 57, and 58 (A copy of the claims on 
appeal, as amended, can be found in the attached Appendix.) 



(4) STATUS OF AMENDMENTS AFTER FINAL 
The Amendment after Final Rejection under 37 C.F.R. §1.116 filed April 7, 2003, has been 
entered for purposes of this appeal. See the Advisory Action mailed May 5, 2003, indicating that the 
amendments would be entered upon filing of an appeal. 



(5) SUMMARY OF THE INVENTION 
Appellants' invention is directed to polynucleotides, comprising the polynucleotide sequence of 
SEQ ID NO:2, encoding phosphate transporter NAPTR, comprising the amino acid sequence of SEQ 
ID NO:l (Specification, e.g., at page 2, line 29 to page 3, line 4; page 10, lines 19-22; page 11, lines 
9-13; and Figures 1A, IB, and 1C). Appellants' invention also includes polynucleotides encoding a 
naturally occurring amino acid sequence at least 90% identical to SEQ ID NO: 1 (e.g., at page 1 1 , lines 
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5-7), polynucleotides encoding a fragment of SEQ ID NO: 1 which transports phosphate (e.g., at page 
8, lines 21-25), polynucleotides comprising a naturally occurring polynucleotide sequence at least 90% 
or 95% identical to SEQ ID NO:2 (e.g., at page 11, lines 5-11), and polynucleotides comprising at 
least 20 or 60 contiguous nucleotides of a polynucleotide consisting of nucleotides 1 183 through 1454 
of SEQ ID NO:2 (e.g., at page 20, lines 10-13; and page 38, lines 1 1-12 and 24-25). The invention 
further includes microarrays comprising the foregoing polynucleotides (e.g., at page 20, lines 2-5), 
recombinant polynucleotides comprising the foregoing polynucleotides (e.g., at page 15, lines 29-32), 
host cells comprising the foregoing polynucleotides (e.g., at page 16, lines 7-13), and methods of 
making polypeptides encoded by the foregoing polynucleotides (e.g., at page 21, lines 3-24). 

NAPTR, encoded by polynucleotides of the invention, is 401 amino acids in length 
(Specification, e.g., at Sequence Listing; and Figures 1 A, IB, and 1C), and has strong chemical and 
structural homology to human renal sodium phosphate transport protein (GenBank ID 450532; SEQ 
ID NO:3) and rat brain-specific sodium-dependent inorganic phosphate cotransporter (GenBank ID 
507415; SEQ ID NO:4) (e.g., at page 10, lines 29-32). In particular, NAPTR shares 48% identity 
with human renal sodium phosphate transport protein and 29% identity with rat brain-specific sodium- 
dependent inorganic phosphate cotransporter (e.g., at page 10, line 32 to page 11, line 1; and Figures 
2A and 2B). NAPTR, human renal sodium phosphate transport protein, and rat brain- specific sodium- 
dependent inorganic phosphate cotransporter each have a potential N-glycosylation site located at 
amino acid residues N49, N49, and N92 of these polypeptides (e.g., at page 10, line 32 to page 11, 
line 2). Furthermore, these three polypeptides "have rather similar hydrophobicity plots" (e.g., at page 
11, lines 2-4; and Figures 3A, 3B, and 3C). Polynucleotides encoding NAPTR were first identified in 
a "brain tumor cDNA library" (e.g., at page 10, lines 23-25), and NAPTR "appears to play a role in 
the regulation of phosphate levels" (e.g., at page 22, lines 2-4). 

The polynucleotides of the present invention are useful, for example, for toxicology testing, drug 
discovery, and disease diagnosis (Specification, e.g., at page 20, lines 2-10; page 30, line 32 to page 
31, line 5; and page 32, lines 9-30). 
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(6) ISSUES 

1. Whether claims 3-7, 9, 10, 12, 13, 46, 48, 57, and 58 meet the utility requirement of 35 
U.S.C § 101. 

2. Whether claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 meet the written description 
requirement of 35 U.S.C. § 1 12, first paragraph. 

3. Whether claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 meet the enablement requirement of 
35 U.S.C. § 112, first paragraph. 

4. Whether claims 3-7, 9, 10, 12, and 57 are unpatentable over claims 1-8 of U.S. Patent No. 
5,985,604, based on alleged obviousness-type double patenting. 

(7) GROUPING OF THE CLAIMS 

As to Issue 1 

All of the claims on appeal are grouped together. 
As to Issue 2 

Claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 are grouped together. 
As to Issue 3 

Claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 are grouped together. 
As to Issue 4 

Claims 3-7, 9, 10, 12, and 57 are grouped together. 

(8) APPELLANTS' ARGUMENTS 

Issue 1 - Whether the claims on appeal meet the utility requirement of 35 U.S.C. § 101 

Claims 3-7, 9, 10, 12, 13, 46, 48, 57, and 58 stand rejected under 35 U.S.C. § 101 based on 
the allegation that the claimed invention is not supported by either a specific and substantial asserted 
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utility or a well-established utility (Office Action, September 20, 2002; page 3, § 4). These rejections 

allege in particular that: 

"the instant specification discloses that the claimed polynucleotide encodes a 
polypeptide that is structurally related to other sodium phosphate transport proteins and 
predicts that the claimed polynucleotide is involved in disorders associated with 
phosphate transport. However, there is no indication that the claimed polynucleotide is 
differentially expressed or is expressed in an altered form in diseased tissues relative to 
normal tissues. The specification does not disclose any evidence indicating altered 
forms or expression levels of the claimed polynucleotide in diseased tissue. Also, no 
evidence has been presented to verify the claimed polynucleotide encodes a 
polypeptide having sodium phosphate transport activity." (Office Action, March 21, 
2003; page 5) 

and 

"[i]f any polynucleotide expressed in a human has utility in toxicology testing, then that 
polynucleotide has no specific utility as all polynucleotides would have such use. 
Therefore, any human polynucleotide could be used as a control in toxicology testing 
and thus this use would not be a specific utility. If a specific disease state were 
correlated with the presence of altered levels or form of a given polynucleotide, then 
that polynucleotide would have specific utility as an indicator of disease." (Office 
Action, March 21, 2003; page 6; emphasis in original) 

The rejection of claims 3-7, 9, 10, 12, 13, 46, 48, 57, and 58 is improper, as the 
inventions of those claims have a patentable utility as set forth in the instant specification, 
and/or a utility well-known to one of ordinary skill in the art. 

The invention at issue is a polynucleotide sequence corresponding to a gene that is expressed in 
human brain tumor tissue (Specification, e.g., at page 10, lines 23-27). The claimed polynucleotide 
encodes a polypeptide demonstrated in the patent specification to be a member of the phosphate 
transporter family, whose biological functions include regulation of intracellular phosphate levels (e.g., at 
page 1, lines 27-31; page 10, line 29 to page 11, line 1; and Figures 2 A and 2B). As such, the claimed 
invention has numerous practical, beneficial uses in toxicology testing, drug development, and the 
diagnosis of disease, none of which require knowledge of how the polypeptide encoded by the claimed 
polynucleotide actually functions. As a result of the benefits of these uses, the claimed invention already 
enjoys significant commercial success. 
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Appellants submit with this brief the declaration of Dr. Tod Bedilion (of record; originally 
submitted on December 16, 2002) describing some of the practical uses of the claimed invention in 
gene and protein expression monitoring applications. The Bedilion Declaration demonstrates that the 
positions and arguments made by the Examiner with respect to the utility of the claimed polynucleotide 
are without merit. 

The Bedilion Declaration describes, in particular, how the claimed expressed polynucleotides 

can be used in gene expression monitoring applications that were well-known at the time the patent 

application was filed, and how those applications are useful in developing drugs and monitoring their 

activity. Dr. Bedilion states that the claimed invention is a useful tool when employed as highly specific 

probes in a cDNA microarray: 

Persons skilled in the art would [on February 24, 1997] appreciate that a cDNA 
microarray that contained the SEQ ID NO: 1 -encoding polynucleotides would be a 
more useful tool than a cDNA microarray that did not contain any of these 
polynucleotides, in connection with conducting gene expression monitoring studies on 
proposed (or actual) drugs for disorders associated with increased or decreased 
phosphate levels for such purposes as evaluating their efficacy and toxicity. (Bedilion 
Declaration, <R 15) 

The Patent Examiner does not dispute that the claimed polynucleotides can be used as probes 
in cDNA micro arrays and used in gene expression monitoring applications. Instead, the Examiner 
contends that the claimed polynucleotides cannot be useful without precise knowledge of their 
biological functions, or the biological functions of their encoded polypeptide. But the law never has 
required knowledge of biological function to prove utility. It is the claimed invention's uses, not its 
functions, that are the subject of a proper analysis under the utility requirement. 

In any event, as demonstrated by the Bedilion Declaration, the person of ordinary skill in the art 
can achieve beneficial results from the claimed polynucleotides in the absence of any knowledge as to 
the precise function of the protein encoded by them The uses of the claimed polynucleotides in gene 
expression monitoring applications are in fact independent of their precise biological functions. 

The Examiner contends that the asserted utility of the claimed polynucleotides and arrays in 
toxicology testing is not specific because "any polynucleotide can be used in a microarray, just as any 
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polynucleotide can be used for expression of an encoded protein or as a hybridization probe" (Office 
Action, March 21, 2003; page 4; emphasis in original). In addition, the Examiner asserts that "[s]ince 
the specification does not disclose convincing evidence of the function of the polypeptide of SEQ ED 
NO:2 or a correlation between any particular disease or disorder and an altered level or form of the 
claimed polynucleotide, the results of gene expression monitoring assays using a cDNA microarray 
comprising the claimed polynucleotide would be meaningless without further research" (Office Action, 
March 21, 2003; page 8; emphasis in original). This is incorrect. While it is true that all 
polynucleotides expressed in humans have utility in toxicology testing based on the property of being 
expressed at some time in development or in the cell life cycle, this basis for utility does not preclude 
that utility from being specific, substantial, and credible. A toxicology test using any particular 
expressed polynucleotide is dependent on the identity of that polynucleotide, not on its biological 
function or its disease association. The results obtained from using any particular human-expressed 
polynucleotide in toxicology testing is specific to both the compound being tested and the 
polynucleotide used in the test. No two human-expressed polynucleotides are interchangeable for 
toxicology testing because the effects on the expression of any two such polynucleotides will differ 
depending on the identity of the compound tested and the identities of the two polynucleotides. It is 
not necessary to know the biological functions and disease associations of the polynucleotides in order 
to carry out such toxicology tests. Therefore, a disclosure of "convincing evidence of the function of the 
polypeptide of SEQ ID NO:2 or a correlation between any particular disease or disorder and an 
altered level or form of the claimed polynucleotide" is not required for the claimed polynucleotides to 
have a specific and substantial utility in toxicology testing. At the very least, the claimed polynucleotides 
are specific controls for toxicology tests in developing drugs targeted to other polynucleotides, and are 
clearly useful as such. 

L The Applicable Legal Standard 

To meet the utility requirement of sections 101 and 1 12 of the Patent Act, the patent applicant 
need only show that the claimed invention is "practically useful," Anderson v. Natta, 480 F.2d 1392, 
1397, 178 USPQ 458 (CCPA 1973) and confers a "specific benefit" on the public. Brenner v. 
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Manson, 383 U.S. 519, 534-35, 148 USPQ 689 (1966). As discussed in a recent Court of Appeals 

for the Federal Circuit case, this threshold is not high: 

An invention is '"useful" under section 101 if it is capable of providing some identifiable 
benefit. See Brenner v. Manson, 383 U.S. 519, 534 [148 USPQ 689] (1966); 
Brooktree Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 [24 
USPQ2d 1401] (Fed. Cir. 1992) ("to violate Section 101 the claimed device must be 
totally incapable of achieving a useful result"); Fuller v. Berger, 120 F. 274, 275 (7th 
Cir. 1903) (test for utility is whether invention "is incapable of serving any beneficial 
end"). 

Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700 (Fed. Cir. 1999). 

While an asserted utility must be described with specificity, the patent applicant need not 

demonstrate utility to a certainty. In Stiftung v. Renishaw PLC, 945 F.2d 1 173, 1 180, 20 USPQ2d 

1094 (Fed. Cir. 1991), the United States Court of Appeals for the Federal Circuit explained: 

An invention need not be the best or only way to accomplish a certain result, and it 
need only be useful to some extent and in certain applications: "[T]he fact that an 
invention has only limited utility and is only operable in certain applications is not 
grounds for finding lack of utility." Envirotech Corp. v. Al George, Inc., 730 F.2d 
753, 762, 221 USPQ 473, 480 (Fed. Cir. 1984). 

The specificity requirement is not, therefore, an onerous one. If the asserted utility is described 
so that a person of ordinary skill in the art would understand how to use the claimed invention, it is 
sufficiently specific. See Standard Oil Co. v. Montedison, S.p.a., 212 U.S.P.Q. 327, 343 (3d Cir. 
1981). The specificity requirement is met unless the asserted utility amounts to a "nebulous expression" 
such as "biological activity" or "biological properties" that does not convey meaningful information 
about the utility of what is being claimed. Cross v. lizuka, 753 F.2d 1040, 1048 (Fed. Cir. 1985). 

In addition to conferring a specific benefit on the public, the benefit must also be "substantial." 
Brenner, 383 U.S. at 534. A "substantial" utility is a practical, "real-world" utility. Nelson v. Bowler, 
626 F.2d 853, 856, 206 USPQ 881 (CCPA 1980). 

If persons of ordinary skill in the art would understand that there is a "well-established" utility 
for the claimed invention, the threshold is met automatically and the applicant need not make any 
showing to demonstrate utility. Manual of Patent Examining Procedure at § 706.03(a). Only if there is 
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no "well-established" utility for the claimed invention must the applicant demonstrate the practical 
benefits of the invention. Id. 

Once the patent applicant identifies a specific utility, the claimed invention is presumed to 
possess it. In re Cortright, 165 F.3d 1353, 1357, 49 USPQ2d 1464 (Fed. Cir. 1999); In re Brana, 
51 F.3d 1560, 1566; 34 USPQ2d 1436 (Fed. Cir. 1995). In that case, the Patent Examiner bears the 
burden of demonstrating that a person of ordinary skill in the art would reasonably doubt that the 
asserted utility could be achieved by the claimed invention. Id. To do so, the Patent Examiner must 
provide evidence or sound scientific reasoning. See In re Lunger, 503 F.2d 1380, 1391-92, 183 
USPQ 288 (CCPA 1974). If and only if the Patent Examiner makes such a showing, the burden shifts 
to the applicant to provide rebuttal evidence that would convince the person of ordinary skill that there 
is sufficient proof of utility. Brana, 51 F.3d at 1566. The applicant need only prove a "substantial 
likelihood" of utility; certainty is not required. Brenner, 383 U.S. at 532. 

II. Toxicology testing, drug discovery, and disease diagnosis are sufficient utilities under 
35 U.S.C. §§ 101 and 112, first paragraph 

The claimed invention meets all of the necessary requirements for establishing a credible utility 
under the Patent Law: There are "well-established" uses for the claimed invention known to persons of 
ordinary skill in the art, and there are specific practical and beneficial uses for the invention disclosed in 
the patent application's specification. These uses are explained, in detail, in the Bedilion Declaration 
accompanying this brief. Objective evidence, not considered by the Patent Examiner, further 
corroborates the credibility of the asserted utilities. 

A, The use of the claimed polynucleotides for toxicology testing, drug discovery, 
and disease diagnosis are practical uses that confer "specific benefits" to the 
public 

The claimed invention has specific, substantial, real- world utility by virtue of its use in toxicology 
testing, drug development and disease diagnosis through gene expression profiling. These uses are 
explained in detail in the accompanying Bedilion Declaration, the substance of which is not rebutted by 
the Examiner. There is no dispute that the claimed invention is in fact a useful tool in cDNA 
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microarrays used to perform gene expression analysis. That is sufficient to establish utility for the 
claimed polynucleotides. 

The instant application is a divisional of, and claims priority to, Lai et al. (U.S. Ser. No. 
09/391,958, filed September 8, 1999; hereinafter "the Lai '958 application"), which is a divisional of, 
and claims priority to, Lai et al. (U.S. Ser. No. 08/805,1 18, filed February 24, 1997; hereinafter "the 
Lai '118 application"). The instant application and the Lai '958 and Lai '118 applications were filed 
with essentially identical specifications, with the exception of corrected typographical errors and 
reformatting. Thus page and line numbers may not match as between the instant application and the Lai 
'958 and Lai '1 18 applications. 

In his Declaration, Dr. Bedilion explains the many reasons why a person skilled in the art 
reading the Lai '118 application on February 24, 1997 would have understood that application to 
disclose the claimed polynucleotides to be useful for a number of gene expression monitoring 
applications, e.g., as highly specific probes for the expression of those specific polynucleotides in 
connection with the development of drugs and the monitoring of the activity of such drugs (Bedilion 
Declaration at, e.g., ffl 10-15). Much, but not all, of Dr. Bedilion' s explanation concerns the use of the 
claimed polynucleotides in cDNA microarrays of the type first developed at Stanford University for 
evaluating the efficacy and toxicity of drugs, as well as for other applications (Bedilion Declaration at, 
e.g.^^andlS). 1 

In connection with his explanations, Dr. Bedilion states that "the specification of the Lai '118 
application would have led a person skilled in the art in February 1997, who was using gene expression 
monitoring in connection with developing new drugs for the treatment of disorders associated with 
increased or decreased phosphate levels, to conclude that a cDNA microarray that contained the SEQ 
ID NO: 1 -encoding polynucleotides would be a highly useful tool and to request specifically that any 
cDNA microarray that was being used for such purposes contain the SEQ ID NO: 1 -encoding 



*Dr. Bedilion also explained, for example, why persons skilled in the art would also appreciate, 
based on the Lai '118 specification, that the claimed polynucleotides would be useful in connection with 
developing new drugs using technology, such as Northern analysis, that predated by many years the 
development of the cDNA technology (Bedilion Declaration, <J[ 16). 
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polynucleotides" (Bedilion Declaration, % 15). For example, as explained by Dr. Bedilion, "[p]ersons 
skilled in the art would [on February 24, 1997] appreciate that a cDNA micro array that contained the 
SEQ ED NO: 1 -encoding polynucleotides would be a more useful tool than a cDNA microarray that did 
not contain any of these polynucleotides, in connection with conducting gene expression monitoring 
studies on proposed (or actual) drugs for disorders associated with increased or decreased phosphate 
levels for such purposes as evaluating their efficacy and toxicity." Id. 

In support of those statements, Dr. Bedilion provided detailed explanations of how cDNA 
technology can be used to conduct gene expression monitoring evaluations, with extensive citations to 
pre- and post-February 24, 1997 publications showing the state of the art on February 24, 1997 
(Bedilion Declaration, fj[ 10-14). While Dr. Bedilion's explanations in paragraph 15 of his Declaration 
include almost three and a half pages of text and six subparts (a)-(f), he specifically states that his 
explanations are not "all-inclusive." Id. For example, with respect to toxicity evaluations, Dr. Bedilion 
had earlier explained how persons skilled in the art who were working on drug development on 
February 24, 1997 (and for several years prior to February 24, 1997) "without any doubt" appreciated 
that the toxicity (or lack of toxicity) of any proposed drug was "one of the most important criteria to be 
considered and evaluated in connection with the development of the drug" and how the teachings of the 
Lai '118 application clearly include using differential gene expression analyses in toxicity studies 
(Bedilion Declaration, f 10). 

Thus, the Bedilion Declaration establishes that persons skilled in the art reading the Lai '118 
application at the time it was filed "would have wanted their cDNA microarray to have a probe to a 
SEQ ED NO: 1 -encoding polynucleotide because a microarray that contained such a probe (as 
compared to one that did not) would provide more useful results in the kind of gene expression 
monitoring studies using cDNA microarrays that persons skilled in the art have been doing since well 
prior to February 24, 1997" (Bedilion Declaration, % 15, item (f) ). This, by itself, provides more than 
sufficient reason to compel the conclusion that the Lai '118 application disclosed to persons skilled in 
the art at the time of its filing substantial, specific, and credible real- world utilities for the claimed 
polynucleotides. 
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Nowhere does the Patent Examiner address the fact that, as described, for example, on pages 
20-21 and 31 of the Lai '118 application, the claimed polynucleotides can be used as highly specific 
probes in, for example, cDNA microarrays - probes that without question can be used to measure both 
the existence and amount of complementary RNA sequences known to be the expression products of 
the claimed polynucleotides. The claimed invention is not, in that regard, some random sequence 
whose value as a probe is speculative or would require further research to determine. 

Given the fact that the claimed SEQ ID NO:2 polynucleotide is known to be expressed, its 
utility as a measuring and analyzing instrument for expression levels is as indisputable as a scale's utility 
for measuring weight. This use as a measuring tool, regardless of how the expression level data 
ultimately would be used by a person of ordinary skill in the art, by itself demonstrates that the claimed 
invention provides an identifiable, real-world benefit that meets the utility requirement. Raytheon v. 
Roper, 724 F.2d 951, (Fed. Cir. 1983) (claimed invention need only meet one of its stated objectives 
to be useful); In re Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999) (how the invention works is 
irrelevant to utility); M.P.E.P. § 2107.01 ("Many research tools such as gas chromatographs, screening 
assays, and nucleotide sequencing techniques have a clear, specific, and unquestionable utility (e.g., 
they are useful in analyzing compounds )" (emphasis added) ). 

Though Appellants need not so prove to demonstrate utility, there can be no reasonable dispute 
that persons of ordinary skill in the art have numerous uses for information about relative gene 
expression including, for example, understanding the effects of a potential drug for treating disorders 
associated with increased or decreased phosphate levels. Because the patent application states 
explicitly that the claimed polynucleotide is known to be expressed in brain tumor cells (see the Lai 
'118 application at, e.g., page 11, lines 20-22; and page 38, lines 25-30), and expresses a protein that 
is a member of a class known to regulate intracellular phosphate levels, there can be no reasonable 
dispute that a person of ordinary skill in the art could put the claimed invention to such use. In other 
words, the person of ordinary skill in the art can derive more information about a potential drug 
candidate for disorders associated with increased or decreased phosphate levels, or potential toxin, 
with the claimed invention than without it (see Bedilion Declaration at, e.g., \ 15, subparts (e)-(f) ). 
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The Bedilion Declaration shows that a number of pre-February 24, 1997 publications confirm 
and further establish the utility of cDNA microarrays in a wide range of drug development gene 
expression monitoring applications at the time the Lai '118 application was filed (Bedilion Declaration 
<H 10-14; and Tabs A-G). Indeed, Brown and Shalon U.S. Patent No. 5,807,522 (the Brown '522 
patent, Bedilion Declaration at Tab D), which issued from a patent application filed in June 1995 and 
was effectively published on December 29, 1995 as a result of the publication of a PCT counterpart 
application, shows that the Patent Office recognizes the patentable utility of the cDNA technology 
developed in the early to mid-1990s. As explained by Dr. Bedilion, among other things (Bedilion 
Declaration, f 12): 

The Brown '522 patent further teaches that the "[m]icro arrays of immobilized nucleic 
acid sequences prepared in accordance with the invention" can be used in "numerous" 
genetic applications, including "monitoring of gene expression" applications (see Tab D 
at col. 14, lines 36-42). The Brown '522 patent teaches (a) monitoring gene 
expression (i) in different tissue types, (ii) in different disease states, and (iii) in response 
to different drugs, and (b) that arrays disclosed therein may be used in toxicology 
studies (see Tab D at col. 15, lines 13-18 and 52-58; and col. 18, lines 25-30). 

Literature reviews published shortly after the filing of the Lai '1 18 application describing the 
state of the art further confirm the claimed invention's utility. Rockett et al. confirm, for example, that 
the claimed invention is useful for differential expression analysis regardless of how expression is 
regulated: 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. 

* * * 

Although differential expression technologies are applicable to a broad range of models, 
perhaps their most important advantage is that, in most cases, absolutely no prior 
knowledge of the specific genes which are up- or down-regulated is required. 

* * * 
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Whereas it would be informative to know the identity and functionality of all genes 
up/down regulated by . . . toxicants, this would appear a longer term goal .... 
However, the current use of gene profiling yields a pattern of gene changes for a 
xenobiotic of unknown toxicity which may be matched to that of well characterized 
toxins, thus alerting the toxicologist to possible in vivo similarities between the unknown 
and the standard, thereby providing a platform for more extensive toxicological 
examination, [emphasis added] 

Rockett et al, Differential gene expression in drug metabolism and toxicology: Practicalities, problems 

and potential , Xenobiotica, 1999, 29:655-691. 

In another article, Lashkari et al. state explicitly that sequences that are merely "predicted" to 

be expressed (predicted Open Reading Frames, or ORFs) - the claimed invention in fact is known to 

be expressed - have numerous uses: 

Efforts have been directed toward the amplification of each predicted ORF or any 
other region of the genome ranging from a few base pairs to several kilobase pairs. 
There are many uses for these amplicons- they can be cloned into standard vectors or 
specialized expression vectors, or can be cloned into other specialized vectors such as 
those used for two-hybrid analysis. The amplicons can also be used directly by, 
for example, arraying onto glass for expression analysis , for DNA binding 
assays, or for any direct DNA assay, [emphasis added] 

Lashkari et al., Whole genome analysis: Experimental access to all genome sequenced segments 

through larger-scale efficient oligonucleotide synthesis and PCR , Proceedings of the National Academy 

of Sciences USA, 1997, 94:8945-8947. 

The Examiner disputes the utility of the claimed invention as a research tool by asserting that the 

claimed polynucleotides and arrays are analogous to "a scale without an identifiable unit of measure - 

one could place an object on the scale, however, further experimentation would be required to interpret 

the result and determine the weight of the object" (Office Action, March 21, 2003; page 12). With 

respect to the utility of the claimed polynucleotides and arrays in toxicology testing, the Examiner is 

wrong. In toxicology testing as asserted, the claimed polynucleotides are not the object of the research. 

The claimed polynucleotides are a research tool used to assess the toxicity of drug candidates which 

are specifically targeted to other polynucleotides. It is the other polynucleotides and the drug 

candidates which are the object of the research. 
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B. The use of nucleic acids coding for proteins expressed by humans as tools for 
toxicology testing, drug discovery, and the diagnosis of disease is now "well- 
established" 

The technologies made possible by expression profiling and the DNA tools upon which they 
rely are now well-established. The technical literature recognizes not only the prevalence of these 
technologies, but also their unprecedented advantages in drug development, testing and safety 
assessment. These technologies include toxicology testing, as described by Dr. Bedilion in his 
declaration. 

Toxicology testing is now standard practice in the pharmaceutical industry. See, e.g., John C. 

Rockett et al., supra: 

Knowledge of toxin-dependent regulation in target tissues is not solely an academic 
pursuit as much interest has been generated in the pharmaceutical industry to harness 
this technology in the early identification of toxic drug candidates, thereby shortening the 
developmental process and contributing substantially to the safety assessment of new 
drugs. (Rockett et al., page 656) 

To the same effect are several other scientific publications, including Emile F. Nuwaysir et al., 

Microarravs and toxicology: The advent of toxico genomics . Molecular Carcinogenesis, 1999, 24:153- 

159; Sandra Steiner and N. Leigh Anderson, Expression profiling in toxicology - potentials and 

limitations . Toxicology Letters, 2000, 1 12-1 13:467-471. 

Nucleic acids useful for measuring the expression of whole classes of genes are routinely 

incorporated for use in toxicology testing. Nuwaysir et al. describes, for example, a Human ToxChip 

comprising 2089 human clones, which were selected 

for their well-documented involvement in basic cellular processes as well as their 
responses to different types of toxic insult. Included on this list are DNA replication 
and repair genes, apoptosis genes, and genes responsive to PAHs and dioxin-like 
compounds, peroxisome proliferators, estrogenic compounds, and oxidant stress. 
Some of the other categories of genes include transcription factors, oncogenes, tumor 
suppressor genes, cyclins, kinases, phosphatases, cell adhesion and motility genes, and 
homeobox genes. Also included in this group are 84 housekeeping genes, whose 
hybridization intensity is averaged and used for signal normalization of the other genes 
on the chip. 

See also Table 1 of Nuwaysir et al. (listing additional classes of genes deemed to be of special interest 
in making a human toxicology microarray). 
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The more genes that are available for use in toxicology testing, the more powerful the technique. 
"Arrays are at their most powerful when they contain the entire genome of the species they are being 
used to study." John C. Rockett and David J. Dix, Application of DNA arrays to toxicology . 
Environmental Health Perspectives, 1999, 107:681-685. Control genes are carefully selected for their 
stability across a large set of array experiments in order to best study the effect of toxicological 
compounds. See attached email from the primary investigator of the Nuwaysir paper, Dr. Cynthia 
Afshari to an Incyte employee, dated July 3, 2000, as well as the original message to which she was 
responding. Thus, there is no expressed gene which is irrelevant to screening for toxicological effects, 
and all expressed genes have a utility for toxicological screening. 

In fact, the potential benefit to the public, in terms of lives saved and reduced health care costs, 

are enormous. Recent developments provide evidence that the benefits of this information are already 

beginning to manifest themselves. Examples include the following: 

• In 1999, CV Therapeutics, an Incyte collaborator, was able to use Incyte gene 
expression technology, information about the structure of a known transporter 
gene, and chromosomal mapping location, to identify the key gene associated 
with Tangiers disease. This discovery took place over a matter of only a few 
weeks, due to the power of these new genomics technologies. The discovery 
received an award from the American Heart Association as one of the top 10 
discoveries associated with heart disease research in 1999. 



In an April 9, 2000, article published by the Bloomberg news service, an Incyte 
customer stated that it had reduced the time associated with target discovery 
and validation from 36 months to 18 months, through use of Incyte' s genomic 
information database. Other Incyte customers have privately reported similar 
experiences. The implications of this significant saving of time and expense for 
the number of drugs that may be developed and their cost are obvious. 

In a February 10, 2000, article in the Wall Street Journal, one Incyte 
customer stated that over 50 percent of the drug targets in its current pipeline 
were derived from the Incyte database. Other Incyte customers have privately 
reported similar experiences. By doubling the number of targets available to 
pharmaceutical researchers, Incyte genomic information has demonstrably 
accelerated the development of new drugs. 
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Because the Patent Examiner failed to address or consider the "well-established" utilities for the 
claimed invention in toxicology testing, drug development, and the diagnosis of disease, the rejections 
should be overturned regardless of their merit. 

C. The similarity of the polypeptide encoded by the claimed invention to another 
polypeptide of undisputed utility demonstrates utility 

In addition to having substantial, specific and credible utilities in numerous gene expression 
monitoring applications, the utility of the claimed polynucleotides can be imputed based on the 
relationship between the polypeptide they encode, NAPTR, and another polypeptide of unquestioned 
utility, human renal sodium phosphate transport protein (NPT1). The two polypeptides have sufficient 
similarities in their sequences that a person of ordinary skill in the art would recognize more than a 
reasonable probability that the polypeptide encoded for by the claimed invention has utility similar to 
NPT1. Appellants need not show any more to demonstrate utility. In re Brana, 51 F.3d at 1567. 

It is undisputed that the polypeptide coded for by the claimed polynucleotides shares 48% 
sequence identity over 401 amino acid residues with NPT1 (Specification, e.g., at page 10, line 32 to 
page 11, line 1; and Figures 2 A and 2B). In addition, NAPTR, NPT1, and rat brain- specific sodium- 
dependent inorganic phosphate cotransporter all share a potential N-glycosylation site (e.g., at page 11, 
lines 1-2), and have rather similar hydrophobicity plots (e.g., at page 11, lines 2-4; and Figures 3A, 3B, 
and 3C). This is more than enough homology to demonstrate a reasonable probability that the utility of 
NPT1 can be imputed to the polynucleotides of the claimed invention (through the polypeptide they 
encode). It is well-known that the probability that two unrelated polypeptides share more than 40% 
sequence homology over 70 amino acid residues is exceedingly small. Brenner et al., Proceedings of 
the National Academy of Sciences USA, 1998, 95:6073-6078. Given homology in excess of 40% 
over more than 70 amino acid residues, the probability that the polypeptide coded for by the claimed 
polynucleotides is related to NPT1 is, accordingly, very high. 

The Examiner must accept the Appellants' demonstration that the homology between the 
polypeptide coded for by the claimed invention and NPT1 demonstrates utility by a reasonable 
probability unless the Examiner can demonstrate through evidence or sound scientific reasoning that a 
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person of ordinary skill in the art would doubt utility. See In re Longer, 503 F.2d 1380, 1391-92, 
183 USPQ 288 (CCPA 1974). The Examiner has not provided sufficient evidence or sound scientific 
reasoning to the contrary. 

While the Examiner has cited literature identifying some of the difficulties that may be involved in 
predicting protein function, none suggests that functional homology cannot be inferred by a reasonable 
probability in this case, van de Loo et al., Proc. Natl. Acad. Sci. USA, 1995, 92:6743-6747; 
Seffernick et al., J. Bacteriol., 2001, 183:2405-2410; Broun et al., Science, 1998, 282:1315-1317; 
Bork, Genome Res., 2000, 10:398-400; Scott et al., Nat. Genet., 1999, 21:440-443; Vrljic et al., J. 
Mol Microbiol. Biotechnol., 1999, 1:327-336; Tenenhouse et al., Am J. Physiol., 1998, 275:F527- 
F534; Murzin et al., J. Mol. Biol., 1995, 247:536-540; Brenner et al., Trends Genet., 1999, 15: 132- 
133. Importantly, none contradicts Brenner's basic rule that sequence homology in excess of 40% 
over 70 or more amino acid residues yields a high probability of functional homology as well. Brenner 
et al., Proceedings of the National Academy of Sciences USA, 1998, 95:6073-6078. More 
importantly, none contradicts Bork's findings in the Bork reference, cited by the Examiner, that there is 
a 70% accuracy rate for bioinformatics-based predictions in general, and a 90% accuracy rate for the 
prediction of functional features by homology. Bork, supra. At most, these articles individually and 
together stand for the proposition that it is difficult to make predictions about function with certainty. 
The standard applicable in this case is not, however, proof to certainty, but rather proof to reasonable 
probability. 

D. Objective evidence corroborates the utilities of the claimed invention 

There is, in fact, no restriction on the kinds of evidence a Patent Examiner may consider in 
determining whether a "real- world" utility exists. "Real- world" evidence, such as evidence showing 
actual use or commercial success of the invention, can demonstrate conclusive proof of utility. 
Raytheon v. Roper, 220 USPQ2d 592 (Fed. Cir. 1983); Nestle v. Eugene, 55 F.2d 854, 856, 12 
USPQ 335 (6th Cir. 1932). Indeed, proof that the invention is made, used or sold by any person or 
entity other than the patentee is conclusive proof of utility. United States Steel Corp. v. Phillips 
Petroleum Co., 865 F.2d 1247, 1252, 9 USPQ2d 1461 (Fed. Cir. 1989). 
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Over the past several years, a vibrant market has developed for databases containing all 
expressed genes (along with the polypeptide translations of those genes), in particular genes having 
medical and pharmaceutical significance such as the instant sequence. (Note that the value in these 
databases is enhanced by their completeness, but each sequence in them is independently valuable.) 
The databases sold by Appellants' assignee, Incyte, include exactly the kinds of information made 
possible by the claimed invention, such as tissue and disease associations. Incyte sells its database 
containing the sequences of the claimed polynucleotides and the encoded polypeptide, and millions of 
other sequences, throughout the scientific community, including to pharmaceutical companies who use 
the information to develop new pharmaceuticals. 

Both Incyte' s customers and the scientific community have acknowledged that Incyte' s 
databases have proven to be valuable in, for example, the identification and development of drug 
candidates. As Incyte adds information to its databases, including the information that can be generated 
only as a result of Incyte' s invention of the claimed polynucleotides, the databases become even more 
powerful tools. Thus the claimed invention adds more than incremental benefit to the drug discovery 
and development process. 

Customers can, moreover, purchase the claimed SEQ ID NO:2 polynucleotide directly from 
Incyte, saving the customer the time and expense of isolating and purifying or cloning the polynucleotide 
for research uses such as those described supra. 

HI. The Patent Examiner's Rejections Are Without Merit 

Rather than responding to the evidence demonstrating utility, the Examiner attempts to dismiss it 
altogether by arguing that the disclosed and well-established utilities for the claimed polynucleotides are 
not "specific" or "substantial" utilities (Office Action, March 21, 2003; page 4). The Examiner is 
incorrect both as a matter of law and as a matter of fact. 
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A. The precise biological role or function of an expressed polynucleotide is not 
required to demonstrate utility 

The Patent Examiner's primary rejection of the claimed invention is based on the ground that, 
without information as to the precise '"biological role" of the claimed invention, the claimed invention's 
utility is not sufficiently specific. According to the Examiner, it is not enough that a person of ordinary 
skill in the art could use and, in fact, would want to use the claimed invention either by itself or in a 
cDNA microarray to monitor the expression of genes for such applications as the evaluation of a drug's 
efficacy and toxicity. The Examiner would require, in addition, that the Appellants provide a specific 
and substantial interpretation of the results generated in any given expression analysis. 

It may be that specific and substantial interpretations and detailed information on biological 
function are necessary to satisfy the requirements for publication in some technical journals, but they are 
not necessary to satisfy the requirements for obtaining a United States patent. The relevant question is 
not, as the Examiner would have it, whether it is known how or why the invention works, In re 
Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999), but rather whether the invention provides an 
"identifiable benefit" in presently available form. Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 
1364, 1366 (Fed. Cir. 1999). If the benefit exists, and there is a substantial likelihood the invention 
provides the benefit, it is useful. There can be no doubt, particularly in view of the Bedilion Declaration 
(at, e.g., 10 and 15), that the present invention meets this test. 

The threshold for determining whether an invention produces an identifiable benefit is low. 
Juicy Whip, 185 F.3d at 1366. Only those utilities that are so nebulous that a person of ordinary skill 
in the art would not know how to achieve an identifiable benefit and, at least according to the PTO 
guidelines, so-called "throwaway" utilities that are not directed to a person of ordinary skill in the art at 
all, do not meet the statutory requirement of utility. Utility Examination Guidelines, 66 Fed. Reg. 1092 
(Jan. 5, 2001). 

Knowledge of the biological function or role of a biological molecule has never been required to 

show real-world benefit. In its most recent explanation of its own utility guidelines, the PTO 

acknowledged as much (66 F.R. at 1095): 

[T]he utility of a claimed DNA does not necessarily depend on the function of the 
encoded gene product. A claimed DNA may have specific and substantial utility 
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because, e.g., it hybridizes near a disease-associated gene or it has gene-regulating 
activity. 

By implicitly requiring knowledge of biological function for any claimed nucleic acid, the 
Examiner has, contrary to law, elevated what is at most an evidentiary factor into an absolute 
requirement of utility. Rather than looking to the biological role or function of the claimed invention, the 
Examiner should have looked first to the benefits it is alleged to provide. 

B. Membership in a class of useful products can be proof of utility 

Despite the evidence that the claimed polynucleotides encode a polypeptide in the phosphate 
transporter family, the Examiner refused to impute the utility of the members of the phosphate 
transporter family to NAPTR. In the Office Action of March 2 1 , 2003, the Examiner takes the position 
that, unless Appellants can identify which particular biological function within the class of phosphate 
transporters is possessed by NAPTR, utility cannot be imputed (Office Action, March 21, 2003; pages 
18-21). To demonstrate utility by membership in the class of phosphate transporters, the Examiner 
would require that all phosphate transporters possess a "common" utility. 

There is no such requirement in the law. In order to demonstrate utility by membership in a 
class, the law requires only that the class not contain a substantial number of useless members. So long 
as the class does not contain a substantial number of useless members, there is sufficient likelihood that 
the claimed invention will have utility, and a rejection under 35 U.S.C. § 101 is improper. That is true 
regardless of how the claimed invention ultimately is used and whether the members of the class 
possess one utility or many. See Brenner v. Manson, 383 U.S. 519, 532 (1966); Application of 
Kirk, 376 R2d 936, 943 (CCPA 1967). 

Membership in a "general" class is insufficient to demonstrate utility only if the class contains a 
sufficient number of useless members such that a person of ordinary skill in the art could not impute 
utility by a substantial likelihood. There would be, in that case, a substantial likelihood that the claimed 
invention is one of the useless members of the class. In the few cases in which class membership did 
not prove utility by substantial likelihood, the classes did in fact include predominately useless members. 
E.g., Brenner (man-made steroids); Kirk (same); Natta (man-made polyethylene polymers). 
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The Examiner addresses NAPTR as if the general class in which it is included is not the 
phosphate transporter family, but rather all polynucleotides or all polypeptides, including the vast 
majority of useless theoretical molecules not occurring in nature, and thus not pre-selected by nature to 
be useful. While these "general classes" may contain a substantial number of useless members, the 
phosphate transporter family does not. The phosphate transporter family is sufficiently specific to rule 
out any reasonable possibility that NAPTR would not also be useful like the other members of the 
family. 

Because the Examiner has not presented any evidence that the class of phosphate transporters 
has any, let alone a substantial number, of useless members, the Examiner must conclude that there is a 
"substantial likelihood" that the NAPTR encoded by the claimed polynucleotides is useful. It follows 
that the SEQ ID NO:2 polynucleotide also is useful. 

Even if the Examiner's "common utility" criterion were correct - and it is not - the phosphate 
transporter family would meet it. It is undisputed that known members of the phosphate transporter 
family are proteins involved in the regulation of intracellular phosphate levels. A person of ordinary skill 
in the art need not know any more about how the claimed invention participates in the regulation of 
intracellular phosphate levels to use it, and the Examiner presents no evidence to the contrary. Instead, 
the Examiner makes the conclusory observation that a person of ordinary skill in the art would need to 
know whether, for example, any given phosphate transporter carries out a particular role in the 
regulation of intracellular phosphate levels The Examiner then goes on to assume that the only use for 
NAPTR absent knowledge as to how the phosphate transporter actually works is further study of 
NAPTR itself. 

Not so. As demonstrated by Appellants, knowledge that NAPTR is a phosphate transporter is 
more than sufficient to make it useful for the diagnosis and treatment of disorders associated with 
increased or decreased phosphate levels. Indeed, NAPTR has been shown to be expressed in human 
brain tumor tissues. The Examiner must accept these facts to be true unless the Examiner can provide 
evidence or sound scientific reasoning to the contrary. But the Examiner has not done so. 
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C. The uses of the claimed polynucleotides in toxicology testing, drug discovery, 
and disease diagnosis are practical uses beyond mere study of the invention 
itself 

The Examiner's rejection of the claims at issue is tantamount to a rejection on the ground that 
the use of an invention as a tool for research is not a "substantial" use. Because the Examiner's 
rejection assumes a substantial overstatement of the law, and is incorrect in fact, it must be reversed. 

There is no authority for the proposition that use as a tool for research is not a substantial utility. 

Indeed, the Patent Office itself has recognized that just because an invention is used in a research setting 

does not mean that it lacks utility (M.P.E.R § 2107.01): 

Many research tools such as gas chromatographs, screening assays, and nucleotide 
sequencing techniques have a clear, specific and unquestionable utility (e.g., they are 
useful in analyzing compounds). An assessment that focuses on whether an invention is 
useful only in a research setting thus does not address whether the specific invention is 
in fact "useful" in a patent sense. Instead, Office personnel must distinguish between 
inventions that have a specifically identified substantial utility and inventions whose 
asserted utility requires further research to identify or reasonably confirm 

The Patent Office's actual practice has been, at least until the present, consistent with that approach. It ? 
has routinely issued patents for inventions whose only use is to facilitate research, such as DNA ligases. ; 
These are acknowledged by the Patent Office's Training Materials to be useful, as are polynucleotide 
sequences used, for example, as markers. 

The subset of research uses that are not "substantial" utilities is limited. It consists only of those 
uses in which the claimed invention is to be an object of further study, thus merely inviting further 
research on the invention itself. This follows from Brenner, in which the U.S. Supreme Court held that 
a process for making a compound does not confer a substantial benefit where the only known use of 
the compound was to be the object of further research to determine its use. Id. at 535. Similarly, in 
Kirk, the Court held that a compound would not confer substantial benefit on the public merely 
because it might be used to synthesize some other, unknown compound that would confer substantial 
benefit. Kirk, 376 F.2d at 940, 945. ("What appellants are really saying to those in the art is take 
these steroids, experiment, and find what use they do have as medicines.") Nowhere do those cases 
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state or imply, however, that a material cannot be patentable if it has some other, additional beneficial 
use in research. 

As used in toxicology testing, drug discovery, and disease diagnosis, the claimed invention has a 
beneficial use in research other than studying the claimed invention or its protein products. It is a tool, 
rather than an object, of research. The data generated in gene expression monitoring using the claimed 
invention as a tool is not used merely to study the claimed polynucleotide itself, but rather to study 
properties of tissues, cells, and potential drug candidates and toxins. Without the claimed invention, the 
information regarding the properties of tissues, cells, drug candidates and toxins is less complete 
(Bedilion Declaration at f 15). 

The use of the claimed invention as a research tool in toxicology testing is specific and 
substantial. While it is true that all polynucleotides expressed in humans have utility in toxicology testing 
based on the property of being expressed at some time in development or in the cell life cycle, this basis 
for utility does not preclude that utility from being specific and substantial. A toxicology test using any 
particular expressed polynucleotide is dependent on the identity of that polynucleotide, not on its 
biological function or its disease association. The results obtained from using any particular human- 
expressed polynucleotide in toxicology testing is specific to both the compound being tested and the 
polynucleotide used in the test. No two human-expressed polynucleotides are interchangeable 
for toxicology testing because the effects on the expression of any two such polynucleotides will differ 
depending on the identity of the compound tested and the identities of the two polynucleotides. It is 
not necessary to know the biological functions and disease associations of the polynucleotides in order 
to carry out such toxicology tests. Therefore, at the very least, the claimed polynucleotides are specific 
controls for toxicology tests in developing drugs targeted to other polynucleotides, and are clearly useful 
as such. 

As an example, any histone gene expressed in humans can be used in a specific and substantial 
toxicology test in drug development. A histone gene may not be suitable as a target for drug 
development because disruption of such a gene may kill a patient. However, a human-expressed 
histone gene is surely an excellent subject for toxicology studies when developing drugs targeted to 
other genes . A drug candidate which alters expression of a histone gene is toxic because disruption of 
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such a pervasively-expressed gene would have undesirable side effects in a patient. Therefore, when 
testing the toxicology of a drug candidate targeted to another gene, measuring the expression of a 
histone gene is a good measure of the toxicity of that candidate, particularly in in vitro cellular assays at 
an early stage of drug development. The utility of any particular human-expressed histone gene in 
toxicology testing is specific and substantial because a toxicology test using that histone gene cannot be 
replaced by a toxicology test using a different gene, including any other histone gene. This specific and 
substantial utility requires no knowledge of the biological function or disease association of the histone 
gene. 

The Examiner contends that the claimed polynucleotides cannot be used in toxicology testing in 
the same way as can histone genes because "histone is ubiquitously expressed and is necessary for cell 
survival and function" whereas "the function of SEQ ID NO: 1 is uncharacterized and the physiological 
implications of altered expression of SEQ ID NO: 1 -encoding polynucleotides is unknown" (Office 
Action, March 21, 2003; page 27). However, the expression of SEQ ED NO: 1-encoding 
polynucleotides in human tissues would lead a skilled artisan to believe that these polynucleotides have 
some physiological implications, even if these implications have not been precisely identified. During 
toxicology testing, a change in expression of a human-expressed polynucleotide indicates potential 
toxicity of a drug candidate, even if the polynucleotide is not absolutely necessary for cell survival and 
function, and even if the physiological implications of that polynucleotide are unknown. Such a 
toxicology test allows one to choose a lead drug candidate which has minimal effects on the expression 
of genes other than the gene to which the candidate is targeted. Such a lead drug candidate would be 
less likely to have unintended side effects than a drug candidate having greater effects on the expression 
of genes other than the intended drug target. Thus, the benefit of such a toxicology test is an increased 
chance of finding a safe and effective drug, and a corresponding reduction in the expense and time of 
bringing a drug to market. 

The Examiner disputes whether the use of the claimed polynucleotides in toxicology testing is 
specific by stating that "any human polynucleotide could be used as a control in toxicology testing and 
thus this use would not be a specific utility" (Office Action, March 21, 2003; page 6; emphasis in 
original). The Examiner doesn't point to any law, however, that says a utility that is shared by a large 
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class is somehow not a utility. If all of the class of expressed human polynucleotides can be used in 
toxicology testing, then they all have utility. The issue is whether the claimed invention has any utility, 
not whether other compounds have a similar utility. Nothing in the law says that an invention must have 
a "unique" utility. Indeed, the whole notion of "well established" utilities presupposes that many 
different inventions can have the exact same utility. If the Examiner's argument was correct, there could 
never be a well established utility, because you could always find a generic group with the same utility! 

The claimed invention has numerous additional uses as a research tool, each of which alone is a 
"substantial utility." These include diagnostic assays (Specification, e.g., at pages 32-33), chromosomal 
mapping (e.g., at pages 33-34), etc. 

D. The Patent Examiner failed to demonstrate that a person of ordinary skill in the 
art would reasonably doubt the utility of the claimed invention 

Based principally on citations to scientific literature identifying some of the difficulties involved in 
predicting protein function, the Examiner rejected the pending claims on the ground that the Appellants 
cannot impute utility to the claimed invention based on the 48% identity over 401 amino acid residues 
between the encoded polypeptide, NAPTR, and another polypeptide undisputed by the Examiner to be 
useful. The Examiner's rejection is both incorrect as a matter of fact and as a matter of procedural law. 

As demonstrated in § II.C, supra, the literature cited by the Examiner is not inconsistent with 
the Appellants' proof of homology by a reasonable probability. It may show that Appellants cannot 
prove function by homology with certainty, but Appellants need not meet such a rigorous standard of 
proof. Under the applicable law, once the Appellants demonstrate a prima facie case of homology, 
the Examiner must accept the assertion of utility to be true unless the Examiner comes forward with 
evidence showing a person of ordinary skill would doubt the asserted utility could be achieved by a 
reasonable probability. See In re Brana, 51 F.3d at 1566; In re hanger, 503 F.2d 1380, 1391-92, 
183 USPQ 288 (CCPA 1974). The Examiner has not made such a showing and, as such, the 
Examiner's rejection should be overturned. 

In the present case, the Examiner contends that the degree of amino acid identity among 
NAPTR and other phosphate transporter proteins is insufficient to establish that NAPTR is a member 
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of the phosphate transporter family and thus shares the same utilties. The Examiner attempted to 
support this assertion with the teachings of van de Loo et al. (Proc. Natl. Acad. Sci. USA, 1995, 
92:6743-6747), Seffernick et al. (J. BacterioL, 2001, 183:2405-2410), Broun et al. (Science, 1998, 
282:1315-1317), Bork (Genome Res., 2000, 10:398-400), Scott et al. (Nat. Genet., 1999, 21:440- 
443), Vrljic et al. (J. Mol. Microbiol. Biotechnol, 1999, 1:327-336), Tenenhouse et al. (Am J. 
Physiol, 1998, 275:F527-F534), Murzin et al. (J. Mol. Biol., 1995, 247:536-540), and Brenner et al. 
(Trends Genet., 1999, 15: 132-133), all of record and addressed below. However, all of these 
references fail to support the outstanding rejections. 

In support of Appellants' use of amino acid sequence homology to reasonably predict the 
biological function of the polypeptide encoded by the claimed polynucleotides, Appellants provide the 
enclosed reference by Brenner et al. ("Assessing sequence comparison methods with reliable 
structurally identified distant evolutionary relationships," Proc. Natl. Acad. Sci. USA, 1998, 95:6073- 
6078). Through exhaustive analysis of a dataset of proteins with known structural and functional 
relationships and with <90% overall sequence identity, Brenner et al. (1998) have determined that 40% 
identity is a reliable threshold for establishing evolutionary homology between two sequences aligned 
over at least 70 residues, and that 30% identity is a reliable threshold between two sequences aligned 
over at least 150 residues (Brenner et al., page 6076). Therefore, the 48% sequence identity between 
SEQ ED NO: 1 and the human renal sodium phosphate transport protein NPT1, over 401 amino acid 
residues, exceeds the thresholds proposed by Brenner et al., and SEQ ID NO: 1 is a true phosphate 
transporter protein by these criteria. Since these criteria are based on a dataset of homologous proteins 
with shared structural and functional features, one of ordinary skill in the art would likewise expect SEQ 
ID NO: 1 to possess the evolutionary conserved structural and functional characteristics of the NPT1 
protein. Hence, the "reasonable correlation" standard as set by case law has been met. 

Contrary to the assertions of the Examiner, the use of such sequence comparisons to predict 
protein function is supported by the Bork reference, cited by the Examiner. The Bork reference 
discloses a 70% accuracy rate in bioinformatics-based predictions. This more than meets the legal 
standard of utility, which requires only that one of skill in the art would more likely than not believe 
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the utility of the claimed invention. For predicting functional features by homology, Table 1 of Bork 
discloses a 90% accuracy rate, even greater than the 70% rate for all bioinformatics predictions. 

The Examiner criticizes the use of the Bork reference, stating that "the evidence provided is a 
crude estimate and that 'numbers in Table 1 are often overestimates because the test sets used are 
usually not representative of all sequences' " (Office Action, March 21, 2003; pages 30-31). Even if 
the figures in Table 1 of Bork are "crude estimates" and "overestimates," they are nevertheless Bork's 
best estimates forjudging the accuracy of bioinformatics analyses. Absent any other estimates, one of 
skill in the art would accept the estimates of Bork as reasonable. In addition, even if the test sets used 
are usually not representative of all sequences, the data compiled by Bork were still the best test sets 
available in the literature, and are indicative of the state of the art at that time. Moreover, in spite of the 
caveats cited by the Examiner, Bork nevertheless states that "there is still no doubt that sequence 
analysis is extremely powerful" (Bork, 2000, page 400, second column, 2nd paragraph). Therefore, 
the Bork reference supports the notion that a skilled artisan would consider functional annotation by 
sequence homology to more likely than not be accurate. 

The Examiner has cited van de Loo et al. and Seffernick et al. as evidence that "homologous 
proteins having significant sequence homology may exhibit different functions" (Office Action, 
September 20, 2002; page 4). van de Loo et al. describe the cloning of a fatty acyl hydroxylase based 
on sequence homology between certain fatty acyl hydroxylases and fatty acyl desaturases. In this 
example, the authors characterize fatty acyl hydroxylases and fatty acyl desaturases as catalyzing similar 
reactions (e.g., van de Loo et al., page 6743, right column, last paragraph), and conclude that the 
reaction mechanisms of oleate 12-hydroxylase and oleate desaturase are similar based on the sequence 
homology between them (e.g., van de Loo et al., abstract). In addition, Broun et al. characterize oleate 
desaturases and oleate hydroxylases as being "members of a large family of functionally diverse 
enzymes" (at page 1315, first column). Since the functions of the proteins described by van de Loo et 
al. are similar, and since these proteins belong to the same family, it is not surprising that they share 
67% sequence homology. In fact, this 67% sequence homology is an accurate indicator that these two 
proteins belong to the same family, further supporting the use of sequence homology to predict protein 
function. 
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Similarly, Seffernick et al. describe a melamine deaminase and an atrazine chlorohydrolase that 
share 98% sequence identity and yet have different substrate specificities. These two enzymes both 
belong to the amidohydrolase enzyme superfamily whose members catalyze the hydrolytic displacement 
of amino groups or chlorine substituents from triazine ring compounds (e.g., Seffernick et al., page 
2409, right column, second paragraph). Notably, there is at least one member of the amidohydrolase 
superfamily that catalyzes both deamination and dechlorination reactions with triazine ring substrates 
(Id.). Therefore, the 98% sequence homology between melamine deaminase and atrazine 
chlorohydrolase correctly predicts their functional similarity and their membership in a common enzyme 
family. 

These examples in which it is difficult to obtain a precise functional prediction do not contradict 
the findings of Bork that, in the majority of cases, protein function is accurately predicted by sequence 
homology methods. In each of these examples, sequence homology methods correctly assign proteins 
to particular enzyme families whose members share similar enzyme activities. Thus, van de Loo et al. 
and Seffernick et al. do not provide any evidence that one of skill in the art would more likely than 
not doubt that NAPTR possesses the utilities of the NPT1 phosphate transporter. 

Seffernick et al. recognize that in "current genome annotation efforts . . . functional assignments 
based on >50% sequence identity are considered to be reasonably sound" (Seffernick et al., page 
2409, left column, paragraph 2). These authors state that their finding that "proteins with >98% 
sequence identity catalyze different reactions in different metabolic pathways is highly exceptional " 
(Id. ; emphasis added). Thus, while there may be a number of examples in which the assignment of 
function by sequence homology is not perfectly accurate, these examples do not contradict the findings 
of Bork that, in general , sequence homology is an accurate method for assigning biological function. 

The Examiner states that "Seffernick et al. teach their result of identifying two proteins with 
>98% identity and having distinct functions ' underlies current genome annotation efforts where 
functional assignments based on >50% sequence identity are considered to reasonably sound' " (Office 
Action, March 21, 2003; page 31; emphasis added). This statement does not support "the examiner's 
argument that sequence identity is not predictive of function." By presenting evidence "underlying" the 
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reasonable soundness of functional assignments based on >50% sequence identity, Seffernick et al. 
support the argument that sequence identity above a certain threshold is predictive of function. 

In further support of the rejection, the Examiner has cited Bork as evidence that "predicting the 
function of a polypeptide encoded by a specific gene, by sequence database searches has a 
considerable error rate" (Office Action, September 20, 2002; page 4). However, this does not negate 
the fact that there is a 90% accuracy rate for the prediction of functional features by homology, as 
disclosed by Bork. At most the Examiner shows that errors can occur in functional assignment. The 
Bork reference does not show that errors do not occur, but it does quantify the error rate at about 
10%. 

The Examiner cites Scott et al. as further evidence that functional prediction by sequence 
homology is not reliable. Scott et al. describe a single example in which sequence homology was only 
partially successful in predicting the protein function of pendrin. In this example, pendrin was correctly 
identified as an anion transporter by sequence homology. However, the assignment of sulfate as a 
substrate for pendrin was later found to be incorrect. This one single example of a partially-incorrect 
functional prediction does not contradict the findings of Bork that, in the majority of cases, protein 
function is accurately predicted by sequence homology methods. Thus, Scott et al. does not provide 
any evidence that one of skill in the art would more likely than not doubt that NAPTR possesses the 
utilities of phosphate transporters. 

The Examiner questions the characterization of the SEQ ID NO: 1 by methods other than 
sequence analysis. For example, the Examiner states that 'Svhile NAPTR (SEQ ID NO: 1), NPT1, and 
rat brain-specific sodium-dependent inorganic phosphate cotransporter may all share a potential N- 
glycosylation site, an ordinarily skilled artisan would recognize that nearly all full-length proteins exhibit a 
potential N-glycosylation site and therefore, would not be a factor in a determination of whether 
NAPTR (SEQ ID NO: 1) and NPT1 share the same function" (Office Action, March 21, 2003; page 
19; emphasis in original). The Examiner misunderstands the significance of the shared N-glycosylation 
site of the three proteins. It is both the presence and the location of the shared N-glycosylation site that 
makes it significant in characterizing the SEQ ID NO: 1 polypeptide. The conservation of the location 
of a single N-glycosylation site among the three proteins confirms the finding that NAPTR possesses 
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the utilities of the phosphate transporters NPT1 and rat brain-specific sodium-dependent inorganic 
phosphate co transporter. 

Furthermore, the Examiner asserts that "while NAPTR (SEQ ID NO: 1), NPT1, and rat brain- 
specific sodium-dependent inorganic phosphate co transporter may have rather similar hydro phobicity 
plots, such plots have been shown to be similar even among proteins with relatively low sequence 
homology that exhibit different functions. For example, Vrljic et al. (J Mol Microbiol Biotechnol 
1:327-336) analyze the hydrophobicities of three proteins that function in the transport of different 
molecules (page 329, Figure 2) revealing strikingly similar hydrophobicity plots" (Office Action, March 
21, 2003; page 19). However, the fact that the proteins of Vrljic et al. are all transporter proteins 
which function in solute export is evidence that similar hydrophobicity plots can be a useful indicator of 
similar biological function. Therefore, Vrljic et al. support the use of similar hydrophobicity plots to 
confirm the findings from sequence homology that NAPTR is likely to possess the functions of 
phosphate transporter proteins. 

The Examiner cites Tenenhouse et al. as evidence that "one of skill in the art would recognize 
that polypeptides with similar functions do not necessarily have similar utilities" (Office Action, March 
21, 2003; page 19). The Examiner's argument is based on the differing expression of the NPT1 and 
NPT2 phosphate transporter proteins. However, the Examiner ignores the fact that both of these 
proteins have utility in transporting phosphate. This utility is based on the phosphate transport function 
of these proteins; the fact that these proteins may be expressed differently in different tissues would not 
prevent these proteins from having utility as phosphate transporters. 

The Examiner criticizes the use of the Brenner et al. (1998) reference by citing Murzin et al., an 
article which describes the SCOP database of proteins. However, the Examiner's criticisms are inapt. 
The use of the SCOP database of Murzin et al. by Brenner et al. (1998) makes the general rules 
obtained by Brenner et al. (1998) more reliable. As the Examiner recognizes, "the proteins within the 
SCOP database have been fully characterized - functionally by empirical laboratory experiments and 
structurally by generating a three-dimensional structure of the proteins" (Office Action, March 21, 
2003; page 20; emphasis in original). Since the reference database used by Brenner et al. (1998) 
contains only proteins which have been fully characterized, any general findings are robust and reliable. 
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The Examiner asserts that "the function of NAPTR (SEQ ED NO: 1) has not been empirically 
determined nor has the three-dimensional structure been solved for comparison with NPT1" (Office 
Action, March 21, 2003; page 20). However, the rules derived by Brenner et al. (1998) are ideal for 
addressing a situation such as this. Brenner et al. (1998) used the robust SCOP database to derive 
general rules for identifying homology between proteins, which could then be applied to any other 
proteins which had not been fully characterized. If it were necessary to have fully characterized a 
protein before it was analyzed using the standards derived by Brenner et al. (1998), then there would 
be no need for such analysis because the protein would already have been characterized! Thus, there is 
no need for an empirical determination of function or an actual determination of three-dimensional 
structure to use the general rules of Brenner et al. (1998). 

In addition, the Examiner incorrectly asserts that "[t]he function of NAPTR has been assigned 
solely on the basis of a relatively low sequence identity to NPT1" (Office Action, March 21, 2003; 
page 20). With this statement, the Examiner completely ignores the findings of Brenner et al. (1998) 
that at least 30% identity over at least 150 amino acid residues, or at least 40% identity over at least 70 
amino acid residues, is a reliable indicator of homology. Instead, the Examiner would substitute the 
opinion that 48% identity over 401 amino acid residues is "low sequence identity" and is inadequate to 
reasonably assign homology between NAPTR and NPT1. No support has been presented to support 
the Examiner's assertion that 48% identity over 401 amino acids is "low sequence identity." 
Furthermore, the Examiner is incorrect in asserting that the function of NAPTR has been assigned 
based "solely" on sequence identity between NAPTR and NPTL The Examiner ignores the fact that 
these proteins share a potential N-glycosylation site at the same location (Specification, e.g., at page 
10, line 32 to page 11, line 2; and Figures 2A and 2B) and have similar hydrophobicity plots (e.g., at 
page 11, lines 2-4; and Figures 3 A, 3B and 3C). 

The Examiner cites Brenner et al. (1999) as further evidence that "laboratory experiments are 
required to verify a protein's function" (Office Action, March 21, 2003; page 20). The question at 
hand, however, is not whether biological function can be predicted with certainty based solely on 
sequence homology. The question is whether the claimed invention meets the statutory requirements for 
utility under 35 U.S.C. § 101. The Examiner is correct in stating that " '[w]ithout laboratory 
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experiments to verify the computational methods and their expert analysis, it is impossible to know for 
certain [whether the function assigned to a protein by annotation is correct]' " (Office Action, March 
21, 2003; page 20). By this statement, the Examiner would seem to be applying a standard that calls 
for a skilled artisan to be absolutely certain of the asserted utility, with no margin for error. However, 
the applicable standard is that one of ordinary skill in the art would more likely than not believe the 
asserted utility. Appellants have met this standard with respect to the claimed invention, and have 
therefore met the utility requirement of 35 U.S.C. § 101. 

The references cited by the Examiner show that there may be difficulties and errors involved in 
predicting protein function by homology. However, these references do not contradict the fact that 
such methods are accurate more often than not. The Examiner has only provided isolated examples in 
which mutations can sometimes result in a shift of the biological activity of a naturally occurring 
polypeptide to a related biological activity found in other members of the polypeptide family. Although 
the Examiner insists that the cited references are representative and are not isolated examples, the 
Examiner has not presented any evidence that this is so. In contrast, the references of Brenner et al. 
(1998) and Bork present findings which are generally applicable to the accuracy and reliability of 
sequence analysis methods because these references have compiled the results of many such 
experiments. As such, one of skill in the art would more likely than not believe that NAPTR has the 
utilities of the family of phosphate transporters. 

As the cited evidence is completely insufficient to support the rejections of the claims, the 
outstanding rejections must be reversed for this reason alone. The only relevant evidence of record 
shows that a person of ordinary skill in the art would not doubt that the polypeptide encoded by the 
claimed polynucleotides is in fact a member of the family of phosphate transporter proteins, which are 
known to have specific utility. 

IV. By Requiring the Appellants to Assert a Particular or Unique Utility, the Patent 
Examination Utility Guidelines and Training Materials Applied by the Patent 
Examiner Misstate the Law 

There is an additional, independent reason to overturn the rejections: to the extent the 
rejections are based on Revised Interim Utility Examination Guidelines (64 FR 71427, December 21, 
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1999), the final Utility Examination Guidelines (66 FR 1092, January 5, 2001) and/or the Revised 

Interim Utility Guidelines Training Materials (USPTO Website www.uspto.gov, March 1, 2000), the 

Guidelines and Training Materials are themselves inconsistent with the law. 

The Training Materials, which direct the Examiners regarding how to apply the Utility 

Guidelines, address the issue of specificity with reference to two kinds of asserted utilities: "specific" 

utilities, which meet the statutory requirements, and "general" utilities, which do not. The Training 

Materials define a "specific utility" as follows: 

A [specific utility] is specific to the subject matter claimed. This contrasts to general 
utility that would be applicable to the broad class of invention. For example, a claim to 
a polynucleotide whose use is disclosed simply as "gene probe" or "chromosome 
marker" would not be considered to be specific in the absence of a disclosure of a 
specific DNA target. Similarly, a general statement of diagnostic utility, such as 
diagnosing an unspecified disease, would ordinarily be insufficient absent a disclosure of 
what condition can be diagnosed. 

The Training Materials distinguish between "specific" and "general" utilities by assessing 
whether the asserted utility is sufficiently "particular," Le. 9 unique (Training Materials at page 52) as 
compared to the "broad class of invention." (In this regard, the Training Materials appear to parallel 
the view set forth in Stephen G. Kunin, Written Description Guidelines and Utility Guidelines , 82 
J.RT.O.S. 77, 97 (Feb. 2000) ("With regard to the issue of specific utility the question to ask is 
whether or not a utility set forth in the specification is particular to the claimed invention.").) 

Such "unique" or "particular" utilities never have been required by the law. To meet the utility 
requirement, the invention need only be "practically useful," Natta, 480 F.2d 1 at 1397, and confer a 
"specific benefit" on the public. Brenner, 383 U.S. at 534. Thus incredible "throwaway" utilities, such 
as trying to "patent a transgenic mouse by saying it makes great snake food," do not meet this standard. 
Karen Hall, Genomic Warfare , The American Lawyer 68 (June 2000) (quoting John Doll, Chief of the 
Biotech Section of USPTO). 

This does not preclude, however, a general utility, contrary to the statement in the Training 
Materials where "specific utility" is defined (page 5). Practical real- world uses are not limited to uses 
that are unique to an invention. The law requires that the practical utility be "definite," not particular. 
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Montedison, 664 E2d at 375. Appellants are not aware of any court that has rejected an assertion of 
utility on the grounds that it is not "particular" or "unique" to the specific invention. Where courts have 
found utility to be too "general," it has been in those cases in which the asserted utility in the patent 
disclosure was not a practical use that conferred a specific benefit. That is, a person of ordinary skill in 
the art would have been left to guess as to how to benefit at all from the invention. In Kirk, for 
example, the CCPA held the assertion that a man-made steroid had '"useful biological activity" was 
insufficient where there was no information in the specification as to how that biological activity could be 
practically used. Kirk, 376 F.2d at 941. 

The fact that an invention can have a particular use does not provide a basis for requiring a 
particular use. See Brana, supra (disclosure describing a claimed antitumor compound as being 
homologous to an antitumor compound having activity against a "particular" type of cancer was 
determined to satisfy the specificity requirement). "Particularity" is not and never has been the sine qua 
non of utility; it is, at most, one of many factors to be considered. 

As described supra, broad classes of inventions can satisfy the utility requirement so long as a 
person of ordinary skill in the art would understand how to achieve a practical benefit from knowledge 
of the class. Only classes that encompass a significant portion of nonuseful members would fail to meet 
the utility requirement. Supra § III.B {Montedison, 664 F.2d at 374-375). 

The Training Materials fail to distinguish between broad classes that convey information of 
practical utility and those that do not, lumping all of them into the latter, unpatentable category of 
"general" utilities. As a result, the Training Materials paint with too broad a brush. Rigorously applied, 
they would render unpatentable whole categories of inventions heretofore considered to be patentable, 
and that have indisputably benefitted the public, including the claimed invention. See supra § III.B. 
Thus the Training Materials cannot be applied consistently with the law. 

Issue 2 - Whether claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 meet the written description 
requirement of 35 U.S.C. § 112, first paragrap h 

Claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 stand rejected under 35 U.S.C. § 1 12, first 
paragraph, based on the allegation that the specification does not describe the subject matter in such a 
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way as to reasonably convey to one of skill in the art that the inventors, at the time the application was 

filed, had possession of the claimed invention. The Examiner asserts that "[t]he specification provides 

only a single representative species of the claimed genus of nucleic acids, i.e., the nucleic acid of 

SEQ ED NO:2 encoding a polypeptide asserted as having phosphate transport activity" (Office Action, 

March 21, 2003; page 34; emphasis in original). This rejection is traversed. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. § 1 12, 

first paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled in the art that, 
as of the filing date sought, he or she was in possession of the invention. The invention 
is, for purposes of the "written description" inquiry, whatever is now claimed. 
Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 1991) 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for Examination 
of Patent Applications Under the 35 U.S.C. Sec. 1 12, para. 1", published January 5, 2001, which 
provide that: 

An applicant may also show that an invention is complete by disclosure of sufficiently 
detailed, relevant identifying characteristics which provide evidence that applicant was 
in possession of the claimed invention, i.e., complete or partial structure, other physical 
and/or chemical properties, functional characteristics when coupled with a known or 
disclosed correlation between function and structure, or some combination of such 
characteristics. What is conventional or well known to one of ordinary skill in the art 
need not be disclosed in detail. If a skilled artisan would have understood the inventor 
to be in possession of the claimed invention at the time of filing, even if every nuance of 
the claims is not explicitly described in the specification, then the adequate description 
requirement is met. [footnotes omitted] 

Thus, the written description standard is fulfilled by both what is specifically disclosed and what 
is conventional or well known to one skilled in the art. 



A, The specification provides an adequate written description of the claimed "variants" and 
"fragments" of SEQ ID NO:l and SEQ ID NO:2. 

The subject matter encompassed by claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 is either 

disclosed by the specification or is conventional or well known to one skilled in the art. 
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First note that the "variant" language of independent claim 3 recites a polynucleotide encoding a 
polypeptide "comprising a naturally-occurring amino acid sequence at least 90% identical to the amino 
acid sequence of SEQ ID NO: 1," and the "variant" language of independent claim 12 recites "a 
polynucleotide comprising a naturally-occurring polynucleotide sequence at least 90% identical to the 
polynucleotide sequence of SEQ ID NO:2." Furthermore, the "fragment" language of independent 
claim 3 recites a polynucleotide encoding a "fragment of a polypeptide having the amino acid sequence 
of SEQ ID NO: 1, wherein said fragment transports phosphate," and the "fragment" language of 
independent claim 13 recites a polynucleotide comprising at least 20 contiguous polynucleotides of "a 
polynucleotide consisting of nucleotides 1 183 through 1454 of the polynucleotide sequence of SEQ ID 
NO:2." 

The amino acid sequence of SEQ ID NO: 1 and the polynucleotide sequence of SEQ ID NO:2 
are explicitly disclosed in the specification. See, for example, the Sequence Listing and Figures 1 A, 
IB, and 1C. Variants of SEQ ID NO: 1 and SEQ ID NO:2 are described in the Specification at, for 
example, page 3, lines 5-7; page 4, lines 29-32; page 5, lines 5-8 and 15-23; page 9, lines 28-30; 
page 11, lines 5-8 and 14-21; page 12, lines 3-4 and 11-30; and page 14, line 22 to page 15, line 12; 
and fragments of SEQ ID NO: 1 and SEQ ID NO:2 are described at, for example, page 3, lines 8-11; 
page 4, lines 23-28; page 8, lines 21-25; page 11, lines 32-33; page 14, lines 19-21; page 20, lines 

10- 13; page 23, lines 23-29; page 40, lines 8-10; page 41, lines 2-5; and page 42, lines 7-10. The 
portion of SEQ ID NO:2 consisting of nucleotides 1 183 through 1454 corresponds to the 
polynucleotide disclosed as Incyte Clone 754412. est (SEQ ED NO:5) at, for example, page 38, lines 

1 1- 12 and 24-25. In addition, a specific assay to measure phosphate transport is disclosed in the 
Specification at, for example, page 41, lines 21-30. 

One of ordinary skill in the art would recognize polynucleotide sequences which are variants 
having a polynucleotide sequence at least 90% or 95% identical to SEQ ID NO:2, or which encode 
polypeptide variants having an amino acid sequence at least 90% identical to SEQ ED NO: 1 . Given 
any naturally occurring polynucleotide sequence, it would be routine for one of skill in the art to 
recognize whether it was a variant of SEQ ID NO:2, and whether it encoded a variant of SEQ ID 
NO: 1 . Accordingly, the specification provides an adequate written description of the recited 
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polynucleotide variants of SEQ ID NO:2 and polynucleotides encoding the recited polypeptide variants 
ofSEQIDNO:l. 

One of ordinary skill in the art would recognize polynucleotide sequences which are fragments 
comprising at least 20 contiguous nucleotides of the portion of SEQ ID NO:2 consisting of nucleotides 
1183 through 1454 (i.e., corresponding to Incyte Clone 754412.est (SEQ ID NO:5) ), or which 
encode polypeptide sequences which are fragments of SEQ ID NO: 1 . The information provided by 
SEQ ID NO: 1 and SEQ ID NO:2 provides the necessary framework for the recited fragments — to 
recite every possible fragment would needlessly clutter the application. Furthermore, it would be 
routine for one of skill in the art to determine whether any particular fragment of SEQ ID NO: 1 had 
phosphate transport activity, using the disclosed phosphate transport assay. Accordingly, the 
specification provides an adequate written description of the recited polynucleotide fragments of SEQ 
ID NO:2, and polynucleotides encoding the recited fragments of SEQ ID NO: 1. 

1. The present claims specifically define the claimed genus through the recitation of 
chemical structure 

Court cases in which "DNA claims" have been at issue (which are hence relevant to claims to 
proteins encoded by the DNA) commonly emphasize that the recitation of structural features or 
chemical or physical properties are important factors to consider in a written description analysis of 
such claims. For example, in Fiers v. Revel, 25 USPQ2d 1601, 1606 (Fed. Cir. 1993), the court 
stated that: 

If a conception of a DNA requires a precise definition, such as by structure, formula, 
chemical name or physical properties, as we have held, then a description also requires 
that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts have 

noted that the claims attempted to define the claimed DNA in terms of functional characteristics without 

any reference to structural features. As set forth by the court in University of California v. Eli Lilly 

and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as 'Vertebrate insulin 
cDNA" or ''mammalian insulin cDNA," without more, is not an adequate written 
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description of the genus because it does not distinguish the claimed genus from others, 
except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 

structural features, has been a common basis by which courts have found invalid claims to DNA. For 

example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written description 

requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within its nucleotide 
sequence a subsequence having the structure of the reverse transcript of an mRNA of a 
vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following count: 

A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an adequate 
written description of the DNA of the count because that application mentioned a potential method for 
isolating the DNA. The Revel priority application, however, did not have a description of any particular 
DNA structure corresponding to the DNA of the count. The court therefore found that the Revel 
priority application lacked an adequate written description of the subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional characteristics 

and were found not to comply with the written description requirement of 35 U.S.C. § 1 12; Le., "an 

mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA which codes for a human 

fibroblast interferon-beta polypeptide" in Fiers. In contrast to the situation in Lilly and Fiers, the 

claims at issue in the present application define polynucleotides and polypeptides in terms of chemical 

structure, rather than functional characteristics. For example, the language of independent claims 3 and 

12 recites chemical structure to define the claimed genus: 

3. An isolated polynucleotide encoding a polypeptide selected from the group 
consisting of: 

a) a polypeptide comprising the amino acid sequence of SEQ ID NO: 1, 

b) a polypeptide comprising a naturally-occurring amino acid sequence at least 90% 

identical to the amino acid sequence of SEQ ID NO: 1, and 
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c) a fragment of a polypeptide having the amino acid sequence of SEQ ID NO: 1 , 
wherein said fragment transports phosphate. 

12. An isolated polynucleotide selected from the group consisting of: 

a) a polynucleotide comprising the polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide comprising a naturally-occurring polynucleotide sequence at 

least 90% identical to the polynucleotide sequence of SEQ ID NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the present 
claims is defined in terms of the chemical structure of SEQ ID NO: 1 and SEQ ID NO:2. In the present 
case, there is no reliance merely on a description of functional characteristics of the polynucleotides and 
polypeptides. The polynucleotides defined by the claims of the present application recite structural 
features, and cases such as Lilly and Fiers stress that the recitation of structure is an important factor to 
consider in a written description analysis of claims of this type. By failing to base the written description 
inquiry "on whatever is now claimed," the Examiner failed to provide an appropriate analysis of the 
present claims and how they differ from those found not to satisfy the written description requirement in 
Lilly and Fiers. 

The Patent Office Guidelines indicate that evidence that Appellants were in possession of the 
claimed invention can include "complete or partial structure, other physical and/or chemical properties, 
functional characteristics when coupled with a known or disclosed correlation between function and 
structure, or some combination of such characteristics" (P.T.O. Guidelines, supra; emphasis added). 
The claimed polynucleotides have been described by chemical structure (e.g., relation of the recited 
polynucleotides to SEQ ID NO:2, relation of the recited polypeptides to SEQ ID NO:l), physical 
properties (e.g., occurrence in nature of the recited variant sequences), and chemical properties (e.g., 
phosphate transport activity of the recited polypeptide fragments). Therefore, the written description 
requirement has been met. 
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2. The present claims do not define a genus which is "highly variant" 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
"highly variant." Available evidence illustrates that, rather than being a large variable genus, the claimed 
genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference by 
Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
evolutionary relationships," Proc. Natl. Acad. Sci. USA, 1998, 95:6073-6078). Through exhaustive 
analysis of a data set of proteins with known structural and functional relationships and with <90% 
overall sequence identity, Brenner et al. have determined that 30% identity is a reliable threshold for 
establishing evolutionary homology between two sequences aligned over at least 150 residues (Brenner 
et al., pages 6073 and 6076). Furthermore, local identity is particularly important in this case for 
assessing the significance of the alignments, as Brenner et al. further report that >40% identity over at 
least 70 residues is reliable in signifying homology between proteins (Brenner et al., page 6076). 

The present application is directed, inter alia, to polynucleotides encoding phosphate 
transporter proteins, including polynucleotides encoding phosphate transporter proteins related to the 
amino acid sequence of SEQ ID NO: 1. In accordance with Brenner et al., naturally occurring 
molecules may exist which could be characterized as phosphate transporter proteins and which have as 
little as 30% identity over at least 150 residues to SEQ ED NO: 1. The "variant language" of the present 
claims recites a polynucleotide encoding a polypeptide comprising "a naturally-occurring amino acid 
sequence at least 90% identical to the amino acid sequence of SEQ ID NO: 1" (note that SEQ ID 
NO: 1 has 401 amino acid residues). This variation is far less than that of polynucleotides encoding all 
potential phosphate transporter proteins related to SEQ ID NO: 1, i.e., those phosphate transporter 
proteins having as little as 30% identity over at least 150 residues to SEQ ED NO: 1. 

The Examiner asserts that "applicants improperly attempt to apply the teachings of Brenner et 
al. (Proc Natl Acad Sci USA 95:6073-6078) to support their argument" (Office Action, March 21, 
2003; page 37), citing as evidence Brenner et al. (Trends Genet., 1999, 15:132-133), Scott et al. (Nat. 
Genet., 1999, 21:440-443), and Seffernick et al. (J. Bacteriol, 2001, 183:2405-2410). In particular, 
the Examiner states that "Brenner (Trends in Genetics 15:132-133) teaches that it is impossible to 
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know the accuracy of functional assignment without empirical laboratory evidence" (Office Action, 
March 21, 2003; page 37). The Examiner appears to again be arguing that functional predictions are 
unreliable unless they can be made with absolute accuracy. As discussed above under Issue 1 (e.g., at 
§ IILD), although the cited references may show that functional predictions cannot be made with 100% 
accuracy, they nevertheless demonstrate that a skilled artisan would believe that such predictions would 
be reasonably accurate. All that is required to satisfy the written description requirement of 35 U.S.C. 
§ 1 12, first paragraph, is that one of skill in the art would reasonably understand that Appellants were 
in possession of the claimed invention at the time the application was filed. 

The Examiner argues that "Brenner et al. clearly does not suggest that all amino acid sequence 
with at least 30% identity over 150 amino acids to another amino acid sequence will share a similar 
function" (Office Action, March 21, 2003; page 37). The Examiner is correct. However, Brenner et 
al. do provide reasonable guidelines forjudging homology. Even if these guidelines are not 100% 
accurate, they would nevertheless lead one of skill in the art to understand that proteins meeting the 
thresholds of Brenner et al. would be reasonably likely to be homologous, and to share similar 
functions. 

Moreover, the Examiner's arguments do not address the degree of variation within the claimed 
genus of polynucleotides. While the Examiner argues that functional assignments based on sequence 
homology are not 100% accurate, these arguments do not preclude the use of the thresholds of Brenner 
et al. to denote a reasonable degree of variation within a genus of polynucleotides or polypeptides. 
Brenner et al. (1998) demonstrates that the claimed genus is not highly variant because the criteria used 
to define the structures of the members of the claimed genus (e.g., at least 90% or 95% identity in 
relation to a reference sequence such as SEQ ID NO: 1 or SEQ ID NO:2) are conservative relative to 
the broadest criteria which a skilled artisan would consider to be reasonable (e.g., the criteria of 
Brenner et al. (1998) that 30% identity over at least 150 residues, or 40% identity over at least 70 
residues, reasonably denotes homology). Thus, it is proper to use the findings of Brenner et al. (1998) 
to demonstrate that the claimed genus of polynucleotides is not highly variant. 
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3. The state of the art at the time of the present invention is further advanced than at 
the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to comply 
with the written description requirement of 35 U.S.C. § 1 12. The '525 patent claimed the benefit of 
priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and Application Serial 
No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the benefit of priority of an 
Israeli application filed on November 21, 1979. Thus, the written description inquiry in those cases 
was based on the state of the art at essentially the "dark ages" of recombinant DNA technology. 

The present application has a priority date of February 24, 1997. Much has happened in the 
development of recombinant DNA technology in the 20 or so years from the time of filing of the 
applications involved in Lilly and Fiers and the present application. For example, the technique of 
polymerase chain reaction (PCR) was invented. Highly efficient cloning and DNA sequencing 
technology has been developed. Large databases of protein and nucleotide sequences have been 
compiled. Much of the raw material of the human and other genomes has been sequenced. With these 
remarkable advances, one of skill in the art would recognize that, given the sequence information of 
SEQ ID NO: 1 and SEQ ID NO:2, and the additional extensive detail provided by the subject 
application, the present inventors were in possession of the claimed polynucleotide variants and 
fragments at the time of filing of this application. 

4. Summary 

The Examiner failed to base the written description inquiry "on whatever is now claimed." 
Consequently, the Examiner did not provide an appropriate analysis of the present claims and how they 
differ from those found not to satisfy the written description requirement in cases such as Lilly and 
Fiers. In particular, the claims of the subject application are fundamentally different from those found 
invalid in Lilly and Fiers. The subject matter of the present claims is defined in terms of the chemical 
structure of SEQ ID NO: 1 and SEQ ID NO:2. The courts have stressed that structural features are 
important factors to consider in a written description analysis of claims to nucleic acids and proteins. In 
addition, the genus of polynucleotides defined by the present claims is adequately described, as 
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evidenced by Brenner et al. Furthermore, there have been remarkable advances in the state of the art 
since the Lilly and Fiers cases, and these advances were given no consideration whatsoever in the 
position set forth by the Examiner. 

For at least the reasons set forth above, the specification provides an adequate written 
description of the claimed subject matter, and this rejection should be overturned. 

Issue 3 - Whether claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 meet the enablement 
re quirement of 35 U.S.C. § 112, first paragraph 

Claims 3, 6, 7, 9, 12, 13, 46, 48, 57, and 58 stand rejected under 35 U.S.C § 1 12, first 
paragraph, based on the allegation that the specification does not describe the subject matter of the 
invention in such a way as to enable one of skill in the art to make and/or use the claimed variants and 
fragments. In particular, the Examiner asserts that "the claimed nucleic acids and array would require 
undue experimentation for a skilled artisan to make and/or use" (Office Action, March 21, 2003; page 
47). Such, however, is not the case. 

The Examiner asserts that "[t]he claims encompass nucleic acids encoding variants and nucleic 
acids comprising fragments that have phosphate transport activity in addition to variant polypeptides 
that are non- functional or exhibit a function other than phosphate transport activity. While techniques 
for isolation of nucleic acids encoding variants are known in the art, other than a method for screening 
those nucleic acids encoding polypeptides having phosphate transport activity, the specification 
provides no additional guidance in the form of assays for identifying those encoded proteins having 
activities other than phosphate transport" (Office Action, March 21, 2003; pages 39-40). With respect 
to the claimed variants, note that claim 3, for example, recites not only that the polynucleotides encode 
polypeptides which are at least 90% identical to SEQ ID NO:l, but also that they have "a naturally 
occurring amino acid sequence ." Through the process of natural selection, nature will have 
determined the appropriate amino acid sequences. Given the information provided by SEQ ID NO: 1 
(the amino acid sequence of NAPTR) and SEQ ED NO:2 (the polynucleotide sequence encoding 
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NAPTR), one of skill in the art would be able to routinely obtain "a naturally occurring amino acid 
sequence at least 90% identical to the amino acid sequence of SEQ ID NO: 1." 

For example, the identification of relevant polynucleotides could be performed by hybridization 
and/or PCR techniques that were well-known to those skilled in the art at the time the subject 
application was filed and/or described throughout the Specification of the instant application. See, e.g., 
page 13, line 7 to page 14, line 8; page 33, lines 9-31; and Example VI at page 40. Thus, one skilled 
in the art need not make and test vast numbers of polynucleotides that encode polypeptides based on 
the amino acid sequence of SEQ ID NO: 1, or vast numbers of polynucleotides based on the 
polynucleotide sequence of SEQ ED NO:2. Instead, one skilled in the art need only screen a cDNA 
library or use appropriate PCR conditions to identify relevant polynucleotides, and their encoded 
polypeptides, that already exist in nature. By adjusting the nature of the probes or nucleic acids (i.e., 
non-conserved, conserved, or highly conserved) and the conditions of hybridization (maximum, high, 
intermediate, or low stringency), one can obtain variant polynucleotides of SEQ ID NO:2 which, in 
turn, will allow one to make the variant polypeptides of SEQ ID NO: 1 recited by the present claims 
using conventional techniques of recombinant protein production. 

By extension, one of skill in art could make fragments of naturally occurring polynucleotides at 
least 90% identical to SEQ ID NO:2, and could use such fragments, for example, as hybridization 
probes to detect full-length naturally occurring polynucleotides at least 90% identical to SEQ ID NO:2. 
In addition, one of skill in the art would be able to routinely obtain probes completely complementary to 
at least 30 contiguous nucleotides of polynucleotides at least 90% identical to SEQ ID NO:2 
(Specification, e.g., at page 31, lines 19-26), and could make arrays comprising such probes and use 
them, for example, to detect full-length naturally occurring polynucleotides at least 90% identical to 
SEQ ID NO:2. 

The Examiner asserts that "there is no guidance provided in the specification as to what all 
nucleic acids encoding variant polypeptides can be used for - particularly those variants encoding non- 
functional polypeptides" (Office Action, March 21, 2003; page 40; emphasis in original). To the 
contrary. The nature of the biological functions, or lack thereof, of the recited polypeptides has no 
bearing on the ability of a skilled artisan to screen a cDNA library or use appropriate PCR conditions 
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to identify relevant polynucleotides, and their encoded polypeptides, that already exist in nature, without 
undue experimentation. It is irrelevant whether any of the claimed polynucleotides encode polypeptide 
variants which have biological functions other than phosphate transport, or any biological functions at 
all. One of skill in the art would still know how to make and use such polynucleotides, without undue 
experimentation. For example, polynucleotides which encode nonfunctional polypeptide variants of 
SEQ ID NO: 1 could be used to detect polynucleotides which encode the polypeptide of SEQ ID 
NO: 1 by, for example, hybridization and/or PCR techniques. It is not necessary for a polynucleotide to 
encode a functional polypeptide for one of skill in the art to be able to use that polynucleotide without 
undue experimentation. 

In addition, the Examiner has ignored the fact that the recited polynucleotide variants have 
specific, substantial, and credible utilities in, for example, toxicology testing in drug discovery (discussed 
above under Issue 1). One of skill in the art would know that, as a part of such toxicology testing, the 
recited polynucleotide variants could be used to detect toxic side effects of drug candidates targeted to 
other polynucleotides. Therefore, the claimed polynucleotides and arrays meet the enablement 
requirement of 35 U.S.C § 1 12, first paragraph, based at least on the well-known, specific, and 
substantial utilities of expressed, naturally occurring, polynucleotides in toxicology testing. 

The Examiner asserts that "the claimed modified nucleic acids are not limited to those encoding 
naturally-occurring polypeptides" (Office Action, March 21, 2003; page 40). Note that, in the claims 
at issue, the use of a minimum percent identity (e.g., at least 90% or 95% identity) to structurally define 
a sequence in relation to a reference sequence (e.g., SEQ ID NO: 1 or SEQ ID NO:2) is coupled with 
a recitation of the "naturally occurring" limitation. The recited "modified nucleic acids" which do not 
encode "naturally occurring" polypeptides, referred to by the Examiner, are polynucleotides encoding 
fragments of SEQ ID NO: 1 having phosphate transport activity (e.g., in claim 3), or fragments of a 
polynucleotide consisting of nucleotides 1 183 through 1454 of SEQ ED NO:2 (e.g., in claims 13 and 
58). One of skill in the art would know how to make such fragments, without undue experimentation, 
based on the disclosed sequences of SEQ ID NO: 1 and SEQ ID NO:2. Knowledge of SEQ ID NO: 1 
and SEQ ID NO:2 allows the skilled artisan to make every one of the recited fragments by, for 
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example, chemical synthesis. Furthermore, one of skill in the art would know how to use such 
fragments, for example, for the detection of the full-length SEQ ID NO:2 polynucleotide or 
polynucleotides encoding the full-length SEQ ID NO: 1 polypeptide, without undue experimentation. 
One of skill in the art would know how to make and use the recited fragments irrespective of whether 
they comprise "naturally occurring" sequences. 

With respect to the claimed fragments, the Examiner asserts that "the function of the nucleic 
acids comprising fragments of SEQ ID NO:2 is not limited to a hybridization probe and, from the 
specification, it appears that an intended use of the claimed nucleic acids is for protein expression" 
(Office Action, March 21, 2003; page 47). Even though the claimed polynucleotide fragments have 
uses other than as hybridization probes, they are nevertheless enabled based at least on their use as 
hybridization probes. The Examiner has provided no evidence that a skilled artisan would not 
reasonably believe that the claimed fragments could be used as hybridization probes. Therefore, a 
prima facie case for non-enablement has not been established with respect to the claimed 
polynucleotide fragments. 

The Examiner would require a precise knowledge of the biological functions of polypeptides 
encoded by the claimed polynucleotides in order to satisfy the enablement requirement of 35 U.S.C. § 
112, first paragraph. However, precise knowledge of biological function is not required for 
enablement. All that is necessary to satisfy the enablement requirement is that one of skill in the art 
would reasonably understand how to make and use the claimed invention. One of the ways in which 
Appellants have satisfied the enablement requirement is by showing that a skilled artisan would 
reasonably understand that the recited polypeptide variants could be used as phosphate transport 
proteins. Nevertheless, the Examiner argues at length that Appellants have not demonstrated to a 
certainty the biological functions of polypeptides encoded by the claimed polynucleotides. 

For example, the Examiner asserts that "Brenner (Trends in Genetics 15:132-133) teaches 
that it is impossible to know the accuracy of functional assignment of a protein based solely on nucleic 
acid sequence without empirical laboratory evidence" (Office Action, March 21, 2003; page 41). In 
support of this assertion, the Examiner cites Brenner et al. (Trends Genet., 1999, 15:132-133), Bork 
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(Genome Res., 2000, 10:398-400), Broun et al. (Science, 1998, 282:1315-1317), Seffernick et al. (J. 
BacterioL, 2001, 183:2405-2410), Gerlt et al. (Genome Biol., 2000, 1: review s0005. 1-0005. 10), and 
Scott et al. (Nat. Genet., 1999, 21:440-443). The Examiner appears to again be arguing that functional 
predictions cannot support patentability unless they can be made with absolute accuracy. As discussed 
above under Issue 1 (e.g., at § III.D), although the cited references may show that functional 
predictions cannot be made with 100% accuracy, they nevertheless demonstrate that a skilled artisan 
would believe that such predictions would be reasonably accurate. All that is required to satisfy the 
enablement requirement of 35 U.S.C. § 1 12, first paragraph, is that one of skill in the art would 
reasonably understand how to make and use the claimed invention. 

The Examiner has based the alleged lack of enablement on the mere possibility that mutations 
can sometimes eliminate the biological function of a naturally occurring polypeptide. This conclusion 
ignores the teachings of Brenner et al. (Proc. Natl. Acad. Sci. USA, 1998, 95:6073-6078; of record), 
which speaks to the general applicability of using sequence homology as low as 30% over 150 amino 
acid residues, and as low as 40% over 70 amino acid residues, to indicate protein homology, and Bork 
(cited by the Examiner), which teaches that the prediction of functional features by homology has a 
90% accuracy rate, and that the accuracy rate for all bioinformatics predictions is 70% (Table 1 of 
Bork). 

The Examiner criticizes the use of the Brenner et al. (1998) reference by stating that these 
authors "state that their comparisons 'have been assessed using proteins whose relationships are 
known reliably from their [three dimensional] structures and functions, as described in the 
SCOP database' ... In the instant case, the identity of the claimed variants is based solely on sequence 
identity - not on their three dimensional structures or their functions" (Office Action, March 21, 2003; 
page 40). However, the Examiner's criticisms are inapt. The use of the SCOP database of Murzin et 
al. (J. Mol. Biol., 1995, 247:536-540; cited by the Examiner) by Brenner et al. (1998) makes the 
general rules obtained by Brenner et al. (1998) more reliable. As the Examiner recognizes, "the 
proteins within the SCOP database have been fully characterized - functionally by empirical 
laboratory experiments and structurally by generating a three-dimensional structure of the proteins" 
(Office Action, March 21, 2003; page 20; emphasis in original). Since the reference database used by 
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Brenner et al. (1998) contains only proteins which have been fully characterized, any general findings 
are robust and reliable. 

The Examiner asserts that "the identity of the claimed variants is based solely on sequence 
identity - not on their three dimensional structures or their functions" (Office Action, March 21, 2003; 
page 40). However, the rules derived by Brenner et al. (1998) are ideal for addressing a situation such 
as this. Brenner et al. (1998) used the robust SCOP database to derive general rules for identifying 
homology between proteins, which could then be applied to any other proteins which had not been fully 
characterized. If it were necessary to have fully characterized a protein before it was analyzed using the 
standards derived by Brenner et al. (1998), then there would be no need for such analysis because the 
protein would already have been characterized! Thus, there is no need for an empirical determination 
of function or an actual determination of three-dimensional structure to use the general rules of Brenner 
etal. (1998). 

Furthermore, the Examiner contends that "Brenner et al. clearly does not suggest that all amino 
acid sequences with at least 30% identity over 150 amino acids to another amino acid sequence will 
share a similar function. Instead, Brenner (Trends in Genetics 15: 132-133) teaches that it is 
impossible to know the accuracy of functional assignment of a protein based solely on nucleic acid 
sequence without empirical laboratory evidence" (Office Action, March 21, 2003; page 41). It may be 
impossible to know if a functional assignment of a protein based on sequence homology is absolutely 
correct, but it is not necessary to meet such a standard to satisfy the enablement requirement of 35 
U.S.C § 1 12, first paragraph. All that is necessary is that a skilled artisan would reasonably know 
how to make and use the claimed invention. 

The Examiner also criticizes the use of the Bork reference, stating that "the evidence provided is 
a crude estimate and that 'numbers in Table 1 are often overestimates because the test sets used are 
usually not representative of all sequences' " (Office Action, March 21, 2003; page 41). Even if the 
figures in Table 1 of Bork are "crude estimates" and "overestimates," they are nevertheless Bork's best 
estimates forjudging the accuracy of bioinformatics analyses. Absent any other estimates, one of skill 
in the art would accept the estimates of Bork as reasonable. In addition, even if the test sets used are 
usually not representative of all sequences, the data compiled by Bork are still the best test sets 
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available in the literature, and are indicative of the state of the art. Moreover, in spite of the caveats 
cited by the Examiner, Bork nevertheless states that "there is still no doubt that sequence analysis is 
extremely powerful" (Bork, 2000, page 400, second column, 2nd paragraph). Therefore, the Bork 
reference supports the notion that a skilled artisan would reasonably understand how to make and use 
the claimed polynucleotide variants and arrays. 

The Examiner cites Broun et al., Seffernick et al., and Gerlt et al. as evidence that "there is a 
high degree of unpredictability in making mutations with an expectation that the encoded polypeptide 
will maintain a similar activity" (Office Action, March 21, 2003; page 41). It may true that there is 
some degree of unpredictability in making mutations to a polypeptide. However, the quantitative 
criteria of Brenner et al. (1998) and Bork's estimates of accuracy demonstrate that it is reasonably 
likely that the polypeptide variants recited by the claims, which are at least 90% identical to SEQ ID 
NO: 1 , would retain the functions of the SEQ ID NO: 1 polypeptide. The degree of unpredictability is 
not so high as to preclude a skilled artisan from believing that there is a reasonable expectation that a 
mutant polypeptide at least 90% identical to a reference polypeptide would retain the activity of the 
reference polypeptide. 

The Examiner has stated that Broun et al. teach that "as few as four amino acid substitutions in a 
polypeptide having approximately 380 amino acids completely alters the enzymatic function of the 
polypeptide from a desaturase to a hydroxylase" (Office Action, September 20, 2002; page 10). 
Broun et al. disclose that "only four changes are required to convert a strict desaturase to an enzyme 
that retains some desaturase activity but is also an efficient hydroxylase" (Broun et al., page 1317, 
left column, 1st paragraph; emphasis added). Thus, the mutations do not completely alter the 
enzymatic function of the polypeptide, as asserted by the Examiner. The mutant polypeptide can still 
be used as a desaturase . Broun et al. also note that "a small number of amino acid substitutions will 
account for the functional divergence of desaturases, hydroxylases, expoxgenases [sic], and acetylenic 
bond-forming enzymes" (Broun et al., page 1317, left column, third paragraph). This supports the 
notion that most amino acid substitutions have no effect or minimal effect on protein function. 

The Examiner has cited Seffernick et al. as evidence that "two polypeptides encoded by 
naturally-occurring polynucleotides, while sharing significant sequence homology, may have completely 
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different functions" (Office Action, September 20, 2002; page 10). Seffernick et al describe a 
melamine deaminase and an atrazine chlorohydrolase that share 98% sequence identity and yet have 
different substrate specificities. These two enzymes both belong to the amidohydrolase enzyme 
superfamily whose members catalyze the hydrolytic displacement of amino groups or chlorine 
substituents from triazine ring compounds (e.g., Seffernick et al., page 2409, right column, second 
paragraph). Notably, there is at least one member of the amidohydrolase superfamily that catalyzes 
both deamination and dechlorination reactions with triazine ring substrates (Id.). Therefore, the 98% 
sequence homology between melamine deaminase and atrazine chlorohydrolase correctly predicts their 
functional similarity and their membership in a common enzyme family. 

This example in which it is difficult to obtain a precise functional prediction does not contradict 
the findings of Bork that, in the majority of cases, protein function is accurately predicted by sequence 
homology methods. In the Seffernick example, sequence homology methods correctly assign proteins 
to a particular enzyme family whose members share similar enzyme activities. Thus, Seffernick et al. do 
not contradict the evidence that one of skill in the art would reasonably conclude that the polypeptide 
variants encoded by the claimed polynucleotides could be used in the same manner as the SEQ ID 
NO: 1 polypeptide. The Examiner insists that the two proteins of Seffernick et al. "are not functionally 
similar . . . Each of the enzymes - while being 99% identical at the encoding nucleic acid level - exhibits 
a distinct function and neither uses the other's substrate" (Office Action, March 21, 2003; page 43). 
However, the fact that there is at least one member of the amidohydrolase superfamily that catalyzes 
both deamination and dechlorination reactions with triazine ring substrates (e.g., Seffernick et al., page 
2409, right column, second paragraph) demonstrates that these enzymes are functionally similar. The 
membership of the two proteins of Seffernick et al. in one enzyme family also demonstrates functional 
similarity. 

The Examiner cites Gerlt et al. as evidence that "homologous members of a superfamily may 
have diverse functions" (Office Action, March 21, 2003; page 43). Even though it may be possible that 
this is correct, this does not negate the fact that structural homology is a reasonably reliable indicator of 
functional homology. The Examiner also contends that Gerlt et al. u teach that 'even within homologous 
families of a single superfamily, the level of sequence similarity required for reliable prediction of 
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function from sequence cannot be specified with confidence' " (Office Action, March 21, 2003; page 
43). This seems be a reiteration of the Examiner's argument that estimates of accuracy in 
bioinformatics analyses, such as those taught by the Bork reference, are not reliable. However, even if 
the error estimates are not completely accurate, they still provide some sense of the degree to which 
bioinformatics analyses can be trusted. Bork compiled the best available bioinformatics analyses 
available in the literature and derived the best available estimates of the accuracy of these analyses. 
Even if Bork's data was compiled from test sets that "are usually not representative of all sequences," 
these test sets were nevertheless the best available, and are indicative of the state of the art at that time. 
Despite the teachings of Gerlt et aL, a skilled artisan would understand that the estimates of Bork 
represent the state of the art in evaluating sequence analyses, and would consider such analyses to be 
reasonably reliable. 

The Examiner states that "Gerlt et al. further teach that their results 'illustrate that mechanistic 
diversity does not require a large significant divergence in sequence, and underscore that high levels of 
sequence identity do not 'guarantee' the same enzymatic function' . . . thus contradicting the teachings 
of Bork et al. and Brenner et al." (Office Action, March 21, 2003; page 43). The Examiner is incorrect 
in asserting that Gerlt et al. contradicts the teachings of Bork and Brenner et al. (1998). Bork and 
Brenner et al. (1998) do not teach that high levels of sequence identity guarantee the same enzymatic 
function. In fact, Bork and Brenner et al. (1998) acknowledge that there are errors in methods 
employing sequence homology. Instead, Bork and Brenner et al. (1998), like Gerlt et al., find that 
sequence homology is a good method that gives reasonably reliable results, even though those results 
are not always 100% correct. 

The Examiner's main assertion seems to be that the cited references "provide evidence for the 
high degree of unpredictability that the claimed variants could be used in the same manner as the NPT1 
phosphate transporter" (Office Action, March 21, 2003; page 43). Appellants do not dispute that 
there is some unpredictability involved in functional annotation based on sequence homology. 
However, contrary to the Examiner's assertions, the degree of unpredictability is not so high as to 
preclude a skilled artisan from reasonably understanding how to make and use the claimed invention. 
The Examiner has not shown that the degree of unpredictability is so high that one of skill in the art 
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would reasonably doubt how to make and use the claimed invention. Instead, the Examiner insists that 
there must be no margin for error in functional annotation in order for the invention to meet the 
enablement requirement of 35 U.S.C. § 1 12, first paragraph. 

Seffernick et al. recognize that in "current genome annotation efforts . . . functional assignments 
based on >50% sequence identity are considered to be reasonably sound" (Seffernick et al., page 
2409, left column, paragraph 2). These authors state that their finding that "proteins with >98% 
sequence identity catalyze different reactions in different metabolic pathways is highly exceptional " 
(Id. ; emphasis added). Thus, while there may be a number of examples in which the assignment of 
function by sequence homology is not perfectly accurate, these examples do not contradict the findings 
of Bork that, in general , sequence homology is an accurate method for assigning biological function. 

The Examiner states that "Seffernick et al. teach their result of identifying two proteins with 
>98% identity and having distinct functions ' underlies current genome annotation efforts where 
functional assignments based on >50% sequence identity are considered to reasonably sound' " (Office 
Action, March 21, 2003; page 43; emphasis added). This statement does not provide "additional 
support for the uncertainty in assigning function based on structural identity alone." By presenting 
evidence ' 'underlying" the reasonable soundness of functional assignments based on >50% sequence 
identity, Seffernick et al. support the argument that sequence identity above a certain threshold is 
reasonably predictive of function. 

In further support of the rejection, the Examiner has cited Bork as evidence that "predicting the 
function of a polypeptide encoded by a specific gene, by sequence database searches has a 
considerable error rate" (Office Action, September 20, 2002; page 10). However, this does not 
negate the fact that there is a 90% accuracy rate for the prediction of functional features by homology, 
as disclosed by Bork. At most the Examiner shows that errors can occur in functional assignment. The 
Bork reference does not show that errors do not occur, but it does quantify the error rate at about 
10%. The Examiner asserts that "[o]ne of the references Bork relies on in establishing [the 90% 
accuracy rate for predicting functional features by homology] is that of Brenner (Trends in Genetics 
15: 132-133). Brenner teaches an error rate of 'at least 8% for the 340 genes annotated', provides 
evidence for why this error rate must be greater, and states, 'the true error rate must be greater than 
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these figures indicate'." (Office Action, March 21, 2003; page 45). However, the assertion that the 
error rate is greater than 8% is consistent with Bork's estimate of a 10% error rate. Thus, Brenner 
(1998) supports Bork's finding of a 90% accuracy rate for the prediction of functional features by 
homology. 

The references cited by the Examiner show that there may be difficulties and errors involved in 
predicting protein function by homology, and that there is '^unpredictability in assigning function based 
on sequence identity alone" (Office Action, March 21, 2003; page 45). However, these references do 
not contradict the fact that such methods are accurate more often than not. As such, one of skill in the 
art would reasonably conclude that the polypeptide variants encoded by the claimed polynucleotides 
possess the functions of the family of phosphate transporter proteins. 

The Examiner has failed to demonstrate that one of skill in the art could not make and use the 
claimed polynucleotides encoding polypeptide variants comprising naturally occurring amino acid 
• sequences at least 90% identical to SEQ ID NO: 1. The Examiner has only provided isolated examples 
in which mutations can sometimes result in a shift of the biological . activity of a naturally occurring 
polypeptide to a related biological activity found in other members of the polypeptide family. Although 
the Examiner insists that the cited references "are representative and are not 'isolated examples'," the 
Examiner has not presented any evidence that this is so. In contrast, the references of Brenner et al. 
(1998) and Bork present findings which are generally applicable to the accuracy and reliability of 
sequence analysis methods because these references have compiled the results of many such 
experiments. Furthermore, the references cited by the Examiner have no bearing on the ability of a 
skilled artisan to screen a cDNA library or use appropriate PCR conditions to identify relevant 
polynucleotides, and their encoded polypeptides, that already exist in nature, without undue 
experimentation. 

The Examiner argues that the claimed polynucleotides and arrays are not enabled because there 
is a lack of guidance as to how to use the entire scope of the claimed polynucleotides, and because the 
unpredictability of the art would prevent a skilled artisan from understanding how to make and use the 
claimed invention. The Examiner's arguments focus on the use of the claimed polynucleotides to 
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express polypeptides having phosphate transport activity (Office Action, March 21, 2003; pages 48- 
50). These arguments ignore the ability of a skilled artisan to make and use all of the claimed 
polynucleotides, including the recited variants and fragments, as hybridization probes or PCR probes to 
detect, for example, the presence of a polynucleotide comprising SEQ ID NO:2 (Specification, e.g., at 
page 33, lines 9-3 1 ; and Example VI at page 40). Furthermore, a skilled artisan could make and use 
all of the claimed arrays to detect, for example, the presence of a polynucleotide comprising SEQ ID 
NO:2 (e.g., at page 31, lines 6-26; and Examples VI at page 40). A skilled artisan would reasonably 
understand how to use the claimed invention in at least these ways. The Examiner has failed to provide 
evidence or sound scientific reasoning to show otherwise. Therefore, the claimed invention meets the 
enablement requirement of 35 U.S.C. § 1 12, first paragraph. 

As set forth in In re Marzocchi, 169 USPQ 367, 369 (CCPA 1971): 

The first paragraph of § 112 requires nothing more than objective enablement. How 
such a teaching is set forth, either by the use of illustrative examples or by broad 
terminology, is of no importance. 

As a matter of Patent Office practice, then, a specification disclosure which contains a 
teaching of the manner and process of making and using the invention in terms which 
correspond in scope to those used in describing and defining the subject matter sought 
to be patented must be taken as in compliance with the enabling requirement of the first 
paragraph of § 112 unless there is reason to doubt the objective truth of the statements 
contained therein which must be relied on for enabling support. 

Contrary to the standard set forth in MarzocchU the Examiner has failed to provide any 
reasons why one would doubt that the guidance provided by the present Specification would enable 
one to make and use the recited polynucleotides encoding polypeptide variants and fragments of SEQ 
ID NO:l, the recited polynucleotide variants and fragments of SEQ ID NO:2, or the recited arrays 
comprising nucleic acid molecules completely complementary to portions of the recited polynucleotides. 
Hence, a prima facie case for non-enablement has not been established with respect to the recited 
variants and fragments of SEQ ID NO: 1 and SEQ ID NO:2. 

For at least the above reasons, reversal of this rejection is requested. 
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Issue 4 - Whether claims 3-7, 9, 10, 12, and 57 are unpatentable over claims 1-8 of U.S. 



Patent No. 5,985,604 



Claims 3-7, 9, 10, 12, and 57 stand rejected under the judicially created doctrine of 
obviousness-type double patenting as being unpatentable over claims 1-8 of U.S. Patent No. 
5,985,604 (the '604 patent). 

Appellants request that the requirement for submission of a Terminal Disclaimer with respect to 
the '604 patent be held in abeyance until such time that there is an indication of allowable subject 
matter. The Examiner has acknowledged this request (Office Action, March 21, 2003; page 51, § 13). 

(9) CONCLUSION 

Appellants respectfully submit that rejections for lack of utility based, inter alia, on an 
allegation of "lack of specificity," as set forth by the Examiner and as justified in the Revised Interim and 
final Utility Guidelines and Training Materials, are not supported in the law. Neither are they 
scientifically correct, nor supported by any evidence or sound scientific reasoning. As is disclosed in 
the specification, and even more clearly, as one of ordinary skill in the art would understand, the 
claimed invention has well-established, specific, substantial and credible utilities. The rejections are, 
therefore, improper and should be reversed. 

Moreover, to the extent the above rejections were based on the Revised Interim and final 
Examination Guidelines and Training Materials, those portions of the Guidelines and Training Materials 
that form the basis for the rejections should be determined to be inconsistent with the law. 

The written descript rejections and enablement rejections should also be reversed, based on at 
least the arguments presented above. 
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Due to the urgency of this matter, and its economic and public health implications, an expedited 
review of this appeal is earnestly solicited. 

If the USPTO determines that any additional fees are due, the Commissioner is hereby 
authorized to charge Deposit Account No. 09-0108. 

This brief is enclosed in triplicate. 

Respectfully submitted, 
INCYTE CORPORATION 

Date: 



Customer No.: 27904 
3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 849-8886 




Terence P. Lo, Ph-D^ 
Limited Recognitic 
Direct Dial Telephi; 





7 C.F.R. § 10.9(b) ) attached 
(650)621-8581 
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APPENDIX 

Claims on appeal: 

3. An isolated polynucleotide encoding a polypeptide selected from the group consisting of: 

a) a polypeptide comprising the amino acid sequence of SEQ ID NO:l, 

b) a polypeptide comprising a naturally-occurring amino acid sequence at least 90% identical 
to the amino acid sequence of SEQ ID NO: 1, and 

c) a fragment of a polypeptide having the amino acid sequence of SEQ ID NO: 1, wherein said 
fragment transports phosphate. 

4. An isolated polynucleotide of claim 3, encoding a polypeptide comprising the amino acid 
sequence of SEQ ID NO: 1. 

5. An isolated polynucleotide of claim 3, comprising the polynucleotide sequence of SEQ ID 

NO:2. 

6. A recombinant polynucleotide comprising a promoter sequence operably linked to a 
polynucleotide of claim 3. 

7. A cell transformed with a recombinant polynucleotide of claim 6. 

9. A method of producing a polypeptide encoded by the polynucleotide of claim 3, the method 
comprising: 

a) culturing a cell under conditions suitable for expression of the polypeptide, wherein said cell 
is transformed with a recombinant polynucleotide, and said recombinant polynucleotide comprises a 
promoter sequence operably linked to the polynucleotide of claim 3, and 

b) recovering the polypeptide so expressed. 
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10. The method of claim 9, wherein the polypeptide has the amino acid sequence of SEQ ID 

NO:l. 

12. An isolated polynucleotide selected from the group consisting of: 

a) a polynucleotide comprising the polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide comprising a naturally-occurring polynucleotide sequence at least 90% 
identical to the polynucleotide sequence of SEQ ID NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

13. An isolated polynucleotide comprising at least 20 contiguous nucleotides of a 
polynucleotide selected from the group consisting of: 

a) a polynucleotide consisting of nucleotides 1 183 through 1454 of the polynucleotide 
sequence of SEQ ID NO:2, 

b) a polynucleotide consisting of a naturally-occurring polynucleotide sequence at least 90% 
identical to nucleotides 1 183 through 1454 of the polynucleotide sequence of SEQ ED NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

46. A microarray wherein at least one element of the microarray is a polynucleotide of claim 

13. 

48. An array comprising different nucleic acid molecules affixed in distinct physical locations on 
a solid substrate, wherein at least one of said nucleic acid molecules comprises a first oligonucleotide or 
polynucleotide sequence completely complementary to at least 30 contiguous nucleotides of a target 
polynucleotide, and wherein said target polynucleotide is a polynucleotide of claim 12. 
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57. A polynucleotide of claim 12, selected from the group consisting of: 

a) a polynucleotide comprising the polynucleotide sequence of SEQ ID NO:2, 

b) a polynucleotide comprising a naturally-occurring polynucleotide sequence at least 95% 
identical to the polynucleotide sequence of SEQ ED NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 

58. An isolated polynucleotide of claim 13, comprising at least 60 contiguous nucleotides of a 
polynucleotide selected from the group consisting of: 

a) a polynucleotide consisting of nucleotides 1 183 through 1454 of the polynucleotide 
sequence of SEQ ID NO:2, 

b) a polynucleotide consisting of a naturally-occurring polynucleotide sequence at least 90% 
identical to nucleotides 1 183 through 1454 of the polynucleotide sequence of SEQ ID NO:2, 

c) a polynucleotide completely complementary to a polynucleotide of a), 

d) a polynucleotide completely complementary to a polynucleotide of b), and 

e) an RNA equivalent of a)-d). 
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1. An important feature of the work of many molecular biologists is identifying which 
genes are switched on and off in a cell under different environmental conditions or 
subsequent to xenobiotic challenge. Such information has many uses, including the 
deciphering of molecular pathways and facilitating the development of new experimental 
and diagnostic procedures. However, the student of gene hunting should be forgiven for 
perhaps becoming confused by the mountain of information available as there appears to be 
almost as many methods of discovering differentially expressed genes as there are research 
groups using the technique. 

2. The aim of this review was to clarify the main methods of differential gene expression 
analysis and the mechanistic principle* underlying them. Also included is a discussion on' 
some of the practical aspects of using this technique. Emphasis is placed on the so-called 
'open * systems, which. require no prior knowledge of the genes contained within the study 
model. Whilst these will eventually be replaced by 'closed* systems in the srudy of human, 
mouse and other commonly studied laboratory animals, they will remain a powerful tool for 
those examining less fashionable models. 

3. The use of suppression-PCR subtractive hybridization is exemplified in the 
identification of up* and down- regulated genes in rat liver following exposure to pheno- 
barbetal, a well-known inducer of the drug metabolizing enzymes. 

4. Differential gene display provides a coherent platform for building libraries and 
microchip arrays of ' gene fingerprints ' characteristic of known enzyme inducers and 
xenobiotic toxicants, which may be interrogated subsequently for the identification and 
characterization of xenobiotics of unknown biological properties. 
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Introduction 

It is now apparent that the development of almost all cancers and many non- 
neoplastic diseases axe accompanied by altered gene expression in the affected cells 
compared to their normal state (Hunter 1991, W-vriford-Thomas 1991, Vogelstein 
arrdKinzler 1 993 , Semenza 1994, Cassidy 1995, Kleinjan and Van Hegningen 1998). 
Such changes also occur in response to external stimuli such as pathogenic micro- 
organisms (Rohn et aL 1996, Singh et aL 1997, Griffin and Krishna 1998. Lunney 
1998) and xenobiotics (Sewall et al. 1995, Dogra et aL 1998, Ramana and Kohli 
1998), as well as during the development of undifferentiated cells (Hecht 1998, 
Rudin and Thompson 1998, Schneider-Maunoury et aL 1998). The potential 
medical and therapeutic benefits of understanding the molecular changes which 
occur in any given cell in progressing from the normal to the 'altered* state are 
enormous. Such profiling essentially provides, a J ^fingerprint ' of each step of a 
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cell's development or response and should help in the elucidation of specific and 
sensitive biomarkers representing, for example, different types of cancer or previous 
' exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobiotic-metabolizing enzymes (including 
the well-characterized isoforms of cytochrome P450) are inducible by drugs and 
chemicals in man (Pelkorien et al. 1998), predominantly involving transcriptional 
activation of not only the cognate cytochrome P.450 genes, but additional cellular 
proteins which may be crucial to the phenomenon of induction. Accordingly, the 
development of methodology to identify and assess the full complement of genes 
that are either up- or down- regulated' by inducers are crucial in the development of 
knowledge to understand the precise molecular mechanisms of enzyme induction 
and how this relates to drug action. Similarly, in the field of chemical-induced 
toxicity, it is now becoming increasingly obvious that most adverse reactions to 
drugs and chemicals are the result of multiple gene regulation, some of which are 
causal and some of which are casually- related to the toxicological phenomenon per 
se. This observation has led to an upsurge in interest in gene- profiling technologies 
which differentiate between the control and toxin-treated gene pools in target tissues 
and is, therefore, of value in rationalizing the molecular mechanisms of xenobiotic- 
induced toxicity. Knowledge of toxin-dependent gene regulation in target tissues is 
not solely an academic pursuit as much interest has been generated in the 
pharmaceutical industry to harness this technology in the early identification of toxic 
drug candidates, thereby shortening the developmental process and contributing 
substantially to the safety assessment of new drugs. For example, if the gene profile 
in response to say a testicular toxin that has been well-characterized in vivo could be 
determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such toxicants. 
Whereas it would be informative to know the identity and functionality of all genes 
up/down regulated by such toxicants, this would appear a longer term goal, as the 
majority of human genes have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling yields a pattern of gene 
changes for a xenobiotic of unknown toxicity which may be matched to that of well- 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a platform for more 
extensive, toxicological examination. Such approaches are beginning to gain 
momentum, in that several biotechnology- companies are commercially producing 
*gene chips* or 'gene arrays' that may be interrogated for toxicity assessment of 
xenobiotics. These chips consist of hundreds/thousands of genes, some of which are 
degenerate- in the sense that not all of the genes are mechanistically-related to any 
one toxicological phenomenon. Whereas these chips are useful in broad-spectrum 
screening, they are maturing at a substantial rate, in that gene arrays are now 
becoming more specific, e.g. chips for the identification of changes in growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. ... . 

Although documenting and explaining "these' genetic changes presents a 
formidable obstacle to understanding the different mechanisms of development and 
disease progression, the technology is now avatkbte-to begin attempting this difficult 
challenge. Indeed, several 'differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 



ation of specific and 
Df cancer or previous 
ers. 

enzymes (including 
ucible by drugs and 
Iving transcriptional 
it additional cellular 
>n. Accordingly, the 
ompiement of genes 
1 the development of 
jf enzyme induction 
Df chemical-induced 
adverse reactions to 
i t some of which are 
cal phenomenon per 
rofiling technologies 
jools in target tissues 
nisms of xeriobiotic- 
jn in target tissues is 
n generated in the 
dentification of toxic 
iss and contributing 
le, if the gene profile 
rized in vivo could be 
itive of all new drug 
of toxicity, thereby 
Dn of such toxicants, 
rtionality of all genes 
ger term goal, as the 
>s their functionality 
ds a pattern of gene 
tched to that of well- 
e in vivo similarities 
a platform for more 
beginning to gain 
jnercially producing 
>xicity assessment of 
es, some of which are 
tically-related to any 
il in broad- spectrum 
jene arrays are now 
nges in growth factor 
' chemically-induced 

changes presents a 
s of development and 
empting this difficult 
methods have been 
:ts that demonstrate 



Differential gene expression , - - 

altered expression in cells of one population compared to another. These methods 
have been used to identify differential gene expression in manv situations, includ.n* 
invading pathogenic microbes (Zhao et al. 1998). in cells responding to extracellular 
and intracellular microbial invasion (Duguid and Dinauer 1990. Ragno tt al 1997 
Maldarell, et al. 1998), in chemically treated cells (Syed tt al. 1997. Rockett « al 
1999). neoplastic cells (Liang et al. 1992. Chang and Terzaghi-Howe 1998)' 
77™, ^ (Gurska y a et aL 1996 - Wan " 1996). differentiated cells (Hara et 
al. 1991. Guimaraes et al. 1995a. b). and different cell types (Davis et al. 1984 
Hednck et al. 1984 Xhu et al. 1998). Although differential express.cn analvsis' 
technolog.es are applicable to a broad range of models, perhaps their most .mportant 
advantage .s that. ,n most cases/absolutely no prior knowledge of the specific genes 
which are up- or down-regulated is required. 

The field of differential expression analysis is a large and complex one with 
many techniques available to the- potential user. These can be categorized into 
several methodological approaches, including: 

(1) Differential screening, . 

(2) Subtractive hybridization (SH) (includes methods such as chemical cross- 
linking subtraction-CCLS. suppression-PCR subtractive hvbridization- 
aari, and representational difference analysis— RDA), 

(3) Differential display (DD), 

(4) Restriction endonuclease facilitated analysis (including serial analvsis of gene 
... " c P ress,on — SAGE— and gene expression fingerprinting— GEF), 

(a) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successfully to isolate differentiallv 
expressed genes in different model systems. However, each method has its own 
subtle (and sometimes not so subtle) characteristics which incur various advantages 
and disadvantages. Accordingly, it is the purpose of this review to clarify the 
mechanistic pnnciples underlying the main differential expression methods and to 
highlight some of the broader considerations and implications of this verv powerful 
and increasingly popular technique. Specifically, we will concentrate on the so- 
called open systems, namely those which do not require anv knowledge of gene 
sequences and. therefore, are useful for isolating unknown genes. Two 'closed 1 
systems (those utilising previously identified gene sequences). EST analvsis and the 
use of DNA arrays. will aistr be considered briefly for comoleteness. Whilst 
emphasis will often be placed on suppression PCR subtractive hvbridization (SSH 
the approach employed in this laboratory), it is the aim of the authors to highlight' 
wherever possible, those areas of common interest to those who use. or intend to use' 
differential gene expression analysis. 

Differential cDNA library screening (DS) 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis 
recogn.t.on of the .mportance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. One of the original 
approaches used to identify such genes was described 20 years ago by St John and 
Davis (1979). These authors developed a method, termed 'differential plaque filter 
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hybridization \ which was used to isolate galactose-inducible DNA sequences from 
yeast. The theory is simple: a gen mic DNA library is prepared frmn rmal, 
unstimulated cells of the test organism/tissue and multiple filter replicas are 
prepared. These replica blots are probed with radioactively (or otherwise) labelled 
complex cDNA probes prepared from the control and test cell mRN A populati ns. 
Those mRNAs which are differentially expressed in the treated cell population will 
show a positive signal only on the filter probed with cDNA from the treated cells. 
Furthermore, labelled cDNA from different test conditions can be used to probe 
multiple blots, thereby enabling the identification of mRNAs which are only up- 
regulated under certain conditions. For example, St John and Davis (1 979) screened 
replica filters with acetate-, glucose- and galactose-derived probes in order to obtain 
genes induced specifically by galactose metabolism. Although groundbreaking in its 
time this method is now considered insensitive and time-consuming, as up to 2 
months are required to complete the identification of genes which are differentially 
expressed in the test population. In addition, there is no convenient way to check 
that the procedure has worked until the whole process has been completed. 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the success of early 
approaches such as that described by St John and Davis (1979) soon gave rise to a 
search for more convenient methods of analysis. One of the first to be developed was 
SH, numerous variations of which have since been reported (see below). In general, 
this approach involves hybridization of mRNA/cDNA from one population (tester) 
to excess mRNA/cDNA from another (driver), followed by separation of the 
unhybridized tester fraction (differentially expressed) from the hybridized comm n 
sequences. This step has been achieved physically, chemically and through the use 
of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

Original subtractive hybridization technology involved the physical separation 
of hybridized common species from unique single stranded species. Several meth ds 
of achieving this have- been described, including hydroxyaparite chromatography 
(Sargent and Dawid 1983), avidin-biotin technology (Duguid and Dinauer 1990) 
and oligodT-latex separation (Hara et al. 1991).- In the first approach, common 
mRNA species are removed by cDNA (from test cells)-mRNA (from control cells) 
subtractive hybridization followed by hydroxyapatite chromatography, as hydroxy- 
aparite specifically adsorbs the cDNA-mRNA hybrids. The unabsorbed cDNA is 
then used either for the construction of a cDNA library of differentially expressed 
genes (Sargent and Dawid 1983, Schneider e'tal. 1988) or directly as a probe to 
screen a preselected library (Zimmerman et al. 1980, Davis et al. 1984, Hedrick et aL 
1984). A schematic diagram of the procedure is shown in figure 1. 

Less rigorous physical separation procedures coupled with sensitivity enhancing 
PCR steps were later developed as a means to overcome some of the problems 
encountered with the hydroxyapatite procedure. For example, Daguid and Dinauer 
(1 990) described a method of subtraction utilizing biotin- affinity systems as a means 
to remove hybridized common sequences. In this process, both the control and 
tester mRNA populations are first converted to cDNA and an adaptor (* oligovector \ 
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Produce clones Label directly and probe library 

Figure 1. The hydroxyapatite method of subtractive hybridization. cDNA derived from the 

^ C l^n er p n tC3ler> P ° P " [ ™ n 15 m " cd w " h a lar * c « cc " oi mRNA from the control tdnver* 
population. Following hybndtzation. mRNA-cDNA hybrid axe removed bv hvdroxvaparite 
chromatography The only cDNAa which remain are those which are differentially expressed in 
the treated/altered population. In order to facilitate the recovery of full length clones, small cDN A 
fragment are removed by exclusion chromatography. The remaining cDNAs are then cloned into 

* I^Jmm*^' or UbelIed and uscd directly 10 probe a library * M dMCribcd by Sar * em 

containing a restriction site) ligated to both sides. Both populations are then 
amplified by PCR ( but the driver cDNA population is subsequently digested with 
the adaptor-contaming restriction endonuclease. This serves to cleave the oligo- 
vector and reduce the amplification potential of the control population. The digested 
control population is then biotinylated and an excess mixed with tester cDNA 
Following denaturation and hybridization, the mix is applied to a biocytin column 
(streptavidin may also be used) to remove the "control population, including 
heteroduplexes formed by annealing of common sequences from the tester 
population. The procedure is repeated several times following the addition of fresh 
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Figure 2. The use of oligodT^ latex to perform subtractive hybridization. mRNA extracted from the 
control (driver) population is converted to anchored cDNA using polydT oligonucleotides 
attached to latex beads. mRNA from the treated/altered (tester) population is repeatedly 
hybridized against an excess of the anchored driver cDNA. The final population of mRNA is 
tester specific and can bexonverted into cDNA for cloning and other downstream applicationa u 
described by Hara et al. (19W). 
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control cDNA. In order to further enrich those species different ially expressed in 
the tester cDNA, the subtracted tester population is amplified by PCR following 
every second subtraction cycle. After six cycles of subtraction (three reamplification 
steps) the reaction mix is ligated into a vector for further analysis. 

In a slightly different approach, Hara et al. (1991) utilized a method whereby 
oligo(dT 30 ) primers attached to a latex substrate are used to first capture mRNA 
extracted from the control population. Following 1st strand cDNA synthesis, the 
RNA strand of the heteroduplexes is removed by heat denaturation and centri- 
fugation (the cDNA-oligotex-dT M forms a pellet and the supernatant is removed). 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 
those which are not foundjn the driver cDNA-oligotex-dT J0 population. These 
tester-specific mRNA species are then converted to cDNA and, following the 
addition of adaptor sequences, amplified by PCR. The PCR products are then 
ligated into a vector for further analysis using restriction sites incorporated into the 
PCR primers. A schematic illustration of this subtraction process is shown in figure 
2, 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA, significant 
loss of material during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analysis have recently 
been designed to eliminate these problems. 
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Chemical Cross-Linking Subtraction (CCLS) 

In this technique, originally described by Hampson et al. (1992), driver mRNA 
is mixed with tester cDNA (1st strand only) in a ratio of > 20: 1. The common 
sequences form cDNA:mRNA hybrids, leaving the tester specific species as single 
stranded cDNA. Instead of physically separating these hybrids, they are inactivated 
chemically using 2,5 diaziridinyl-l,4-benzoquinone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species (unreacted 
mRNA species remaining from the-driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
tq3cxeen-a cDNA library made from the tester cell population. A schematic diagram 
of the system is shown in figure 3 . 

It has been shown that the differentially expressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al. 1992), and that the 
technique should allow isolation of cDNAs derived from transcripts that are present 
at less than 50 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the CCLS approach are that it is 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. However, like the physical separation 
protocols, a major drawback with CCLS is the large amount of starting material 
required (at least 10 fig RNA). Consequently, the technique has recently been 
refined so that a renewable source of RNA can be generated. The degenerate random 
oligonucleotide primed (DROP) adaptation (Hampson et aL 1996, Hampson and 
Hampson 1997) uses random hexanucleotide sequences to prime solid phase- 
synthesized cDNA. Since each primer includes a T7 polymerase promotor sequence 
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Figure 3. Chemical cross-linking subtraction. Excess driver mRNA is mixed with 1" strand tester 
cDNA-The common seq uenc es rorm mRNA : cDNA hybrids which are cross linked with 2.5 
duzuiOiuy l-ft .4-benxoqumone ( DZQ) and the rernaining cDNA sequences are dirTerentially 
expressed in the tester population. Probes are made from these sequences using Sequenase 2.0 
DNA polymerase, which lacks reverse transcriptase activity and. therefore, does not react with the 

remaining mRNA molecules from the driver. The labelled probes a«e then used to screen a cDNA 

' library for clones of differentially expressed sequences* Adapted from Walter el aL (1996), with 



Table 1 . The abundance of mRNA species and classes in a typical mammalian cell; 



mRNA 
class 


Copies of . 

each 
species/cell 


No. of mRNA Mean % of 
species in each species 
class in class 


Mean mass 
(ng) of each 
species/^g 
total RNA 


Abundant 

Intermediate 

Rare 


12000 
300 
15 


4 

500 
11000 


3.3 

0.08 * 
0.004 


1.65 
0.04 
0.002 


— Modified from Bertioli tt at. (1995). - " - - 
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at the 5' end, the final pool of random cDNA -fragments is a PCR-renewable cDNA 
population which is representative of the expressed gene pool and can be used to 
synthesize sense RNA for use as driver material. Furthermore, if the final pool of 
random cDNA fragments is reamplified using biotinylated T7 primer and random 
hexamer, the product can be captured with streptavidin beads and the antisense 
strand eluted for use as tester. Since both target and driver can be generated from 
the same DROP product, subtraction can be performed in both directions (i.e. for 
up- and down-regulated species) between two different DROP products. 

Representational Difference Analysis (RDA) 

RDA of cDNA (Hubank and Schatz 1994) is an extension of the technique 
originally applied to genomic DNA as a means of identifying differences between 
two complex genomes (Lisitsyn et al. 1993). It is a process of subtraction and 
amplification involving subtractivc hybridization of the tester in the presence of 
excess driver. Sequences in the tester that have homologues in the driver are 
rendered unamplifiable, whereas those genes expressed only in the tester retain the 
ability to be amplified by PCR. The procedure is shown schematically in figure 4. 

In essence, the driver and tester mRNA populations are first converted to cDNA 
and amplified by PCR following the ligation of an adaptor. The adaptors are then 
removed from both populations and a new (different) adaptor ligated to the 
amplified tester population only. Driver and tester populations are next melted and 
hybridized together in a ratio of 100: 1. Following hybridization, only tester: tester 
homohybrids have 5' adaptors at each end of the DNA duplex and can, thus, be filled 
in at both 3' ends. Hence, only these molecules are amplified exponentially during 
the subsequent PCR step. Although tester: driver heterohybrids are present, they 
only amplify in a linear fashion! since the strand derived from the driver has no 
adaptor to which the primer can bind. Driver: driver heterohybrids have no 
adaptors and, therefore, are not amplified. Single stranded molecules are digested 
with mung bean nuclease before a further PCR-enrichment of the tester: tester 
homohybrids. The adaptors on the amplified tester population are then replaced and 
the whole process repeated a further two or three times using an increasing excess of 
driver (Hubank and Shatz used a tester : driver ratio of 1:400, 1:80000 and 
1 : 800000 for the second, third and fourth hybridizations, respectively). Different 
adaptors are ligated to the tester between successive rounds of hybridization and 
amplification to prevent the accumulation of PCR products that might interfere with 
subsequent amplifications. The final display is a series of differentially expressed 
gene products easily observable on an ethidium bromide geh 

main advantages of RDA are that it offersa reproducible and sensitive 

approach to the analysis of differentially expressed genes!. Hubank and Schatz (1994) 
reported that they were able to isolate genes that were differentially expressed in 
substantially less than 1 % of the cells from which the tester is derived. Perhaps the 
main drawback is that multiple rounds of ligation, hybridization, amplifiation and 
digestion are required. The procedure is, therefore, lengthier than many other 
differential display approaches and provides more opportunity for operator-induced 
error to occur. Although the generation of false" positives has been noted, this has 
been solved'to some degree by O'Neill and Sinclair (1997) through the use of HPLC- 
purified adaptors. These are free of the truncated adaptors which appear to be a 
major source of the false positive bands. A very similar technique to RDA, termed 
linker capture subtraction (LCS) was described by Yang and Sytowski (1996). 
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Figure 4. The representational difference analysis (RDA) technique. Driver and tester cDNA are 
digested with a 4-cutter restriction enzyme such as Dpnll. The 1" set of 12/24 adaptor strands 
(oligonucleotides) are ligated to each other and the digested cDNA products. The 12mer is 
subsequently melted away and the 3*ends filled in using Taq DNA polymerase. Each cDNA 
population is then amplified using PCR, following which the 1" set of adaptors is removed with 
Dpnll, A second set of 12/24 adaptor strands is then added_to_the amplified tester cDNA 
populauon, after which the tester is hybridized against ~a Krge excess of driver. The 12mer 
adaptors are melted and the 3' ends filled in as before. PCR is carried out with primers identical 
to the new 24mer adaptor. Thus, the only hybridization products which are exponentially 
amplified are those which are tester : tester combinations- Following PCR, ssDNA products arc 
removed with mung bean nuclease, leaving the 1 first difference product*. This is digested and a 
third set of 12/24 adaptors added before repeating the subtraction process from the hybridization 
stage. The process is repeated to the 3 rt or 4 th difference product, as described by Lisitsvn et at 
(1993) and Hubank and Schatz (1994). " - ■ — ^ - - - 
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Suppression PCR Subtractive Hybridization (SSH) 

The most recent adaptation of the SH approach to differential expression 
analysis was first described by Diatchenko et al, (1996) and Gurskava et al (1996) 
They reported that a 1000-5000 fold enrichment of rare cDXAs (equivalent to 
isolating mRNAs present at only a few. copies per cell) can be obtained without the 
need for multiple hybridizations/subtractions. Instead of phvsical or chemical 
removal of the common sequences, a PCR-based suppression svstem is used (see 
figure 5). 

In SSH, excess driver cDN A is added tcrrwo portions of the tester cDN A which 
have been ligated with different adaptors. A first round of hvbridization serves to 
enrich differentially expressed genes and equalize rare and abundant messages 
Equalization occurs since reannealing is more rapid for abundant molecules than for 
rarer molecules due to the second order kinetics of hybridization (James and Higgins 
1 98d). The two primary hybridization mixes are then mixed together in the presence 
of excess driver and allowed to hybridize further. This step permits the annealing of 
single stranded complementary sequences which did not hvbridize in the primary 
hybridization, and in doing so generates templates for PCR amplification. Although 
there are several possible combinations of the single stranded molecules present in 
the secondary hybridization mix, only one particular combination (differentially 
expressed in the tester cDNA composed of complimentary strands having different 
adaptors) can amplify exponentially. 

Having obtained the final differential display, two options are available if cloning 
of cDNAs is desired. One is to transform the whole of the final PCR reaction into 
competent cells. Transformed colonies can then be isolated and their inserts 
characterized by sequencing, restriction analysis or PCR. Alternative^, the final 
PCR products can be resolved on a gel and the individual bands excised, reamplified 
and cloned. The first approach is technically simpler and less time consuming. 
However, ligation/ transformation reactions are known to be biased towards the 
cloning of smaller molecules, and so the final population of clones will probablv not 
contain a representative selection of the larger products. In addition, although 
equalization theoretically occurs, observations in this laboratorv suggest that this is 
by no means perfectly accomplished. Consequently, some gene species are present 
in a higher number than others and this will be represented in the final population 
of clones. Thus, in order to obtain a substantial proportion of those gene species that 
actually demonstrate differential expression-in the tester population, the number of 
clones that will have to be screened after this step may be substantial. The second 
approach is initially more time consuming and technically demanding. However, it 
would appear to offer better prospects for_ cloning larger and low abundance gel 
products. In addition, one can incorporate" a" screening step that differentiates 
different products of different sequences but of the same size (HA-staining, see 
later). In this way, a good idea of the final number of clones to be isolated 1 and 
identified can be achieved. 

An alternative (or even complementary) approaches to use the final differential 
display reaction to screen a cDNA library to isolate full length clones for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
expression profiles of enzyme-inducers such as phenobarbital (Rockett et al. 1997) 
and Wy-14,643 (Rockett et al. unpublished observations). The isolation of 
differentially expressed genes in this manner enables the construction of a fingerprint 



666 



J. C. Rockett ct al. 



Tester cONA with adaptor 1 



Driver cONA 
On excess) 



Tetter cONA with adaptor 2 




Mix samples, add fresh denatured driver, anneal 



*.b,c,d& e 




Add primers and 
^ ampiifyJjyPCR 

i, d no amplification 

b no amplification- suppressed due to 

formation of panhandle structure 

c linear amplification 

e exponential amplification 

tor DCtween i and 8 n. This serves two purposes -(lho M tuli» *><.-« 9r% A . n r maa * 

hybndumon. are mbced together without denaturing. Fre* denatured driver can HZZ^XZ 
.« th,. po.„t to allow further enrichment of differently «prc»" d s ^ulcr! ^ 



Differential gene expression 



667 



5f cONA with adaptor 2 

ZZ2 



ZZ2. 



ZZZ- 



S3. 



i due to 
ture 



n excess of driver cDNA is 
red and allowed to hybridize 
nd abundant molecules; and 
: not differentially expressed 
idization, the two primary 
red driver can also be added 
sequences. Type e molecules 
uplifted using rwo rounds of 
led directly or cloned into a 
tt al. (1996) and Gurskaya 



Control animals 



Treated animals 



Extract mRNA from 
tissue of interest 
e.g. liver 



T 



Extract mRNA from 
tissue of interest 
e.g. liver 



Dnase-treatment 



Dnase-treatment 



Convert to cONA 



Complex probe for 
screening clones 



Convert to cONA 



Hybridization, subtraction and amplification 
* — ^Control driving tester for up-regulated genes 
Tester driving control for down-regulated genes 



Complex probe 
for screening 
clones 



Run out products on agarose gel 



Extract individual bands and done in 
T/A vector 



Screen using standard 
and HA agarose 



PCR of 5-10 clone 
cultures per 
extracted band 



Different clones blotted 
and screened with up- 
regulated genes 



Screen using standard 
and HA agarose 



Plasmid mini-preps 
of selected clones 



Differentially expressed 
clones selected 



Different clones blotted 
and screened with down- 
regulated genes 



Sequencing and 
. identification r 

Figure 6. Flow diagram showing method used tn this laboratory to isolate and identirv clones ot eenes 
which are differentially expressed in rat liver following short term exposure to the enrvme 
inducers, phenobarbital and Wy-1 4,643. 

of expressed genes which are unique to each compound and time/dose point. Such 
information could be useful in short-term characterization of the toxic potential of 
new compounds by comparing the gene -expression profiles they elicit with those 
produced by known inducers. Figure 6 shows a flow diagram of the method used to 
isolate, verify and clone differentially expressed genes, and figure 7 shows expression 
profiles obtained from a typical SSH experiment. Subsequent sub-cloning of the 
individual bands, sequencing and gene data base interrogation reveals many genes 
which are either up- or down-regulated by phenobarbital in the rat (tables 2 and 3). 

One of the advantages in using the SSH approach is that no prior knowledge is 
required of which specific genes are up /down- regulated subsequent to xenobiotic 
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phenobarbetal. mRN A extracted from control and treated livers was used to generate the 
UddT " " "T 8 * e , PCR - Sele « cDNA subtraction kit (Clontech). Un":"-lkb 

dowi . V en ! S "P"*" 1 "" 1 following phenobarbital treatment; 5-gene. 

TZ5m w^K mg Phen0barbital tr "™«": *-lkb ladder. Reproduced from Rockett " 
a/. 1 with permission. 

exposure, and an almost complete complement of genes are obtained. For example 
the peroxisome prohferator and non-genotoxic hepatocarcinogen VVv,14 643 up- 
regulates at least 28 genes and down- regulates at least IS in the rat (a sensitive 
species) and produces 48 up. and 37 down-regulated genes in the guinea pig a 
resistant species (Rockett. Swales, Esda and Gibson, unpublished observations) 
One of these genes. CD81. was up-regulated in the rat and down-regulated in the 
guinea pig following Wy-14,643 treatment. GD81 (alternatively named TAP \.\) is 
a widely expressed cell surface protein which is involved in a large number of cellular 
P / r0C oo S 0 CS mclud,ng adhcsi °«- activation, proliferation and differentiation (Lew et 
al. 1 998). S ince all of these functions are altered to some extent in the phenomena 
ot hepatomegaly and non-genotoxic hepatocarcmogenes.s. it is intriguing, and 
probably mechanistically-relevant, that CD81 expression is differentially regulated 
in a resistant and susceptible species. However, the down-side of this approach is 
that the majority of genes can be sequenced and matched to database sequences but 
the latter are predominantly expressed sequence tags or genes of completely 
unknown function, thus partially obscuring a realistic overall assessment of the 
critical genes of genuine^ biological interest^ Notwithstanding the lack of complete 
funtional identification of altered gene expression, such gene profiling studies 
essentially provides a 'molecular fingerprint 'in response to xenobiotic challenge 
thereby serving as a mechanistically-relevant platform for further detailed 
investigations. 7" 

Differential Display (DD) 

Originally described as ' RNA fingerprinting by_ari>itrarily primed PCR 1 (Liang 
and Pardee 1992) this method is now more commonly referred to as 'differential 
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Table 2. 


Genes up-regulated in rat liver following 3-day exposure to phenobarbital. 


Band number 






(approximate 


Highest sequence 




size in bp) 


similarity 


FASTA-EN1BL gene identification 


5 (1300) 


93.5 ° 0 


CYP2B1 


7(1000) 




Preproalbumin 


8 (950) 




Serum albumin mRNA 


98.3 % 


NCI-CGAP-Prl H. sapiens (EST) 


10 (850) 


95.7% 


CYP2B1 


1 1 (800) 


Clone I 94.9 % 


CYP2B1 




Clone 2 75.3% 


CYP2B2 


12 (750) 


93.8% 


TRPM-2 mRNA 


15 (600) 




Sulfated glycoprotein 


92.9% 


Preproalbumin 


16(55) 




Serum albumin mRNA 


Clone 1 95.2% 


CYP2B1 


21 (350) 


Clone 2 93.6% 


Haptoglobulin mRNA partial alpha 


99.3 % 


18S, 5.8S& 28S rRNa 



Bands 1-f, 6, 9, 13, 14, and 17-20 are shown to be false positives by dot blot anavlsis and. therefore, 
are not sequenced. Derived from Rockett et at. (1997). It should be noted that the above genes do not 
represent the complete spectrum of genes which are up-regulated in rat liver lay phenobarbital, but 
simply represents the genes sequenced and identified to date. 



Table 3. Genes down-regulated in rat liver following 3-day exposure to phenobarbital. 



Band number 








(approximate 


Highest sequence 




size in bp) 


similarity 


FASTA-EMBL gene identification 


1 (1500) 




95.3% 


3-oxoacyl-CoA thiolase 


2(1200) 




92.3% 


Hemopoxin mRNA 


3 (1000) 




91.7% 


Alpha- 2u-globulin mRNA 


7(700) 


Clone 1 


77.2% 


M.museulus CI inhibitor 




Clone 2 


94.5% 


Electron transfer flavoprotein 


8 (650) 


Clone 3 


91.0% 


. M. musculus Topoisomerase I (Topo 1) 


Clone I 


86.9%' 


Soares 2NbMT M. musculus (EST) 


9 (600) 


Clone 2 


96.2% 


Alpha-2u-globulin (s-type) mRNA 


Clone 1 


86.9% 


Soares mouse NML M. musculus (EST) 


10 (550) 


Clone 2 


82.0% 


Soares p3NMF 19.5 .W. musculus (EST) 




73.8% 


Soares mouse NML M. musculus (EST) 


11 (525) 




95.7% 


NCI-CGAP-Prl H. sapiens (EST) 


12 (375) 




100.0% 


Ribosomal protein 


13(23) 


Clone 1 


97.2', 


Soares mouse embrvo NbMEl 35 (ESTl 




. Clone 2 


100.0% 


Fibrinogen B-beta-cnain 


14(170) 


Clone 3 


100.0% 


Apolipoprotem E gene 




96.0%- 


Soares p3NMF19.5 Af. musculus (EST) 


15(140) 




97.3% 


Stratagcne mouse testis (EST) 


Others: (300) 




96.7% 


R. norvegicus RASP 1 mRNA 


(275) 




93.1% 


Soares mouse mammary gland (EST*) 



EST o Expressed sequence tag. Bands 4-6 were shown to be false positives by dot blot analysis and, 
therefore, were not sequenced. Derived from Rockett et al. (1 997). It should be noted that the above genes 
do not represent the complete spectrum of genes which are down -regulated in rat liver by phenobarbital, 
but aimipty represent* the genes sequenced and identified to date. 



display* (DD). In this method, all the mRNA species in the control and treated cell 
populations are amplified in separate reactions using reverse transcriptase- PCR 
(RT-PCR). The products are then run side-by-side on sequencing gels. Those 
bands which are present in one display only^ or- which are much more intense in one 
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display compared to the other, are differentially expressed and may be recovered for 
further characterization. One advantage of this system is the speed with which it can 
be carried out— 2 days to obtain a display and as little as a week to make and identify' 
clones. 

Two commonly used variations are based on different methods of priming the 
reverse transcription step (figure 8). One is to use an oligo dT with a 2-base 'anchor ' 
at the 3'-end, e.g. 5' (dT n )CA 3' (Liang and Pardee 1992). Alternative^, an 
arbitrary primer may be used for 1st strand cDNA synthesis (Welsh et aL 1*992). 
This variant of RNA fingerprinting has also been called 'RAP' (RNA Arbitrarily 
Primed)-PCR. One advantage of this second approach is that PCR products mav be 
derived from anywhere in the RNA, including open reading frames. In addition, it 
can be used for mRNAs that are not polyadenylated, such as manv bacterial mRNAs 
(Wong and McClelland 1994). In both cases, following reverse transcription and 
denaruration, second strand cDNA synthesis is carried out with an arbitrarv primer 
(arbitrary primers have a single base at each position, as compared to random 
primers, which contain a mixture of all four bases at each position). The resulting 
PCR, thus, produces a series of products which, depending on the system (primer 
length and composition, polymerase and gel system), usuallv includes 50-100 
products per primer set (Band and Sager 1989). When a combination of different 
dT-anchors and arbitrary primers are used, almost all mRNA species from a ceil can 
be amplified. When the cDN A products from two different populations are analvsed 
side by side on a polyacrylamide gel, differences in expression can be identified' and 
the appropriate bands recovered for cloning and further analysis. 

Although DD is perhaps the most popular approach used today for identifving 
differentially expressed genes, it does suffer from several perceived disadvantages: 

(1) It may have a strong bias towards high copy number mRNAs (Bertioli et al. 

1995), although this has been disputed (Wan et al. 1996) and the isolation of very 

. low abundance genes may be achieved in certain circumstances (Guimeraes et 
al. 1995a). 

(2) The cDNAs obtained often only represent the extreme 3' end of the mRNA 
(often the 3 '-untranslated region), although this mav not alwavs be the case 
(Guimeraes et aL 1995a). Since the 3' end is often not included in Genbank and 
shows variation between organisms. cDNAs identified by DD cannot alwavs be 
matched with their genes, even if they have been identified. 

(3) The pattern of differential expression seen on the display often cannot be 
reproduced on Northern blots, with false positives arising in up to 70 % of cases 
(Sun et al. 1994). Some adaptations have been shown to reduce false positives, 
including the use of two reverse transcriptases (Sung and Denman 1997)! 
comparison of uninduced and induced cells over a time course (Burn et al. 1994) 
and comparison of DDPCR-products from two uninduced and two induced 

- lines (Sompayrac et aL 1995). The latter authors also reported that the use of 
cytoplasmic RNA rather then total RNA. reduces false positives arising fr m 
nuclear RNA that is not transported to the cytoplasm. 

Further details of the background, strengths and weaknesses of the DD 
technique ^cah be obtained from a revlew~by McClelland ' et al (1996) and fr m 
articles by Liang et aL (1995) and WanVr aL (1996)7" 
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(dTu)CA: AC 



1 11 strand cDNA 

4 AC 

— UGAAAAAAA 



•AAAAAAAA 

Arbitrary primer: 




1* strand cDNA 
A 



-AAAAAAA 



Denature and synthesise 2* strand 
with any arbitrary primer ( ) 



2 nd strand cDNA 



2r* strand cDNA 
► 



cDNA can now be amplified by PCR using original primer pair 

Figure 8. Two approaches to differential display (DD) analysis. I" strand svnthesis can be carried out 
either with a polydT,, Wpnmer (where N = G, C or A) or with an arbitral primer. The use of 
different combinations of G. C and A to anchor the first strand polydT primer enables the primin* 
of the majority of polyadenvlated mRNAs. Arbitrary primers may hvbridize at none, one or more 
places along the length of the mRNA, allowing I" strand cDNA svnthesis to occur at none one 
or more points in the same gene. In both cases, 2* a strand synthesis is earned out with an arbiirarv 
primer, bince these arbitrary primers for the V* strand may also hvbridize to the I " strand cDN A 
m a number of different places, several different 2™ strand products may be obtained from one 
binding point of the 1« strand primer. Following 2- strand synthesis, the original set of primers 
is used to amplify the second strand products, with the result that numerous gene sequences are 
ampuneu. 

Restriction endonuclease-facilitated analysis of gene expression 
Serial Analysis of Gene Expression (SAGE ) 

A more recent development in the field of differential displav is SAGE analysis 
(Velculescu et aL 1995). This method uses a different approach to those discussed so 
far and is based on two principles. Firstly, in more than 95% of cases; short 
nucleotide sequences Ctags-') of- only- nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatenation, (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
procedure, double stranded cDNA from the test cells is synthesized with a. 
^°. t ! n _ y . Ia . tcd P ol y dT primer.J-ollowing -digestion with a commonly cutting (4bp 
recognition sequence) restriction eruyme Canchoring enzyme'), the 3' ends of the 
cDN A population are captured with wreptavidin beads. The captured population is 
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split into two and different adaptors ligated to the 5' ends of each group. Incorporated 
into the adaptors is a recognition sequence for a type IIS restriction enzyme— one 
which- cuts DNA at a defined distance (< 20 bp) from its recognition sequence. 
Hence, following digestion of each captured cDNA population with the IIS enzvme. 
the adaptors plus a short piece of the captured cDNA are released. The two 
populations are then ligated and the products amplified. The amplified products are 
cleaved with the original anchoring enzyme, religated (concatomers are formed in 
the process) and cloned. The advantage of this system is that hundreds of gene tags 
can be identified by sequencing only a few clones. Furthermore, the number f times 
a given transcript is identified is a quantitative measurement of that gene's 
abundance in the original population, a feature which facilitates identification of 
differentially expressed genes in different cell populations. 

Some disadvantages of SAGE analysis include the technical difficulty f the 
method, a large amount of accurate sequencing is required, biased towards abundant 
mRNAs, has not been validated in the pharmaco/toxicogenomic setting and has 
only been used to examine well known tissue differences to date. 

Gene Expression Fingerprinting (GEF) 

A different capture/restriction digest approach for isolating differentially 
expressed genes has been described by Ivanova and Belyavsky (1995). In this 
method, RNA is converted to cDNA using biotinylated oligo(dT) primers. The 
cDNA population is then digested with a specific endonuclease and captured with 
magnetic streptavidin microbeads to facilitate removal of the unwanted 5' digestion 
products. The use of restricted 3 '-ends alone serves to reduce the complexity of the 
cDNA fragment pool and helps to ensure that each RNA species is represented by 
not more than one restriction product. An adaptor is ligated to facilitate subsequent 
amplification of the captured population. PCR is carried out with one adapt r- 
specific and one biotinylated polydT * primer. The reamplified population is 
recaptured and the non-biotinylated strands removed by alkaline dissociation. The 
non-biotinylated strand is then resynthesized using a different adaptor-specific 
primer in the presence of a radiolabelled dNTP. The labelled immobilized 3' cDNA 
ends are next sequentially treated with a series of different restriction endonucleases 
and the products from each digestion analysed by PAGE. The result is a fingerprint 
composed of a number of ladders tequal to the number of sequential digests used). 
By comparing test versus control fingerprints, it is possible to identify differentially 
expressed products which can then be isolated from the gel and cloned. The 
advantages of this procedure are that it is very robust and reproducible, and the 
authors estimate that 80-93% of cDNA molecules are involved in the final 
fingerprint. The disadvantage is that polyacrylamide gels can rarely resolve more 
than 300-400 bands, which compares poorly to the 1000 or more which are 
estimated to be produced in- an average experiment. The use of 2-D gels such as 
those described by Uitterlinden etal. (1989) and Hatada et al. (1991) may help to 
overcome this problem. 

similar method for displaying restriction endonuclease fragments was later 
described, by Prashar_and,"We^sman (1 996)7 Ho'wcve r. instead of sequential 
digestion of the immobolized 3 j- terminal _cDN A fragments, these authors simply 
compared the profiles ot the ' control and -trearted-^populations without further 
jcnanipulation. . . \ 
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-AAAA 



-rm* 



1* strand cDNA synthesis using 
biotinylated poly dT primers 



GTAC 



cONA cleaved with AE and 
captured with streptavivin beads 



-AAAA 



GTAC 



Divide in harf and ftgate (inkers 




CATC 
GTAC. 



-AAAA 



CATG . 
GTAC 




CATG. 
GTAC 

CATG 
GTAC- 




Cleave with tagging enzyme (TE) j 
and produce blunt ends 



GGATGCATGXXXXXXXXX 
CCTACGTACXXXXXXXXX 



GGATGCATGOOOOOOOOO 
CCTACGTACOOOOOOOOO 



TE 



Tag 



TE AE 



Tag 



| Ugate and amplify 



GGATGCATGXXXXXXXXXOOOOOOOOOCATGCATCC 
CCTACGTACXXXXXXXXX000000OO0GTACGTAGG 



J 



AE 



DiTag 



AE 



Cleave w& AE, isaafie diTags, 
concatenate, done ana 
sequence 

AE 



-^TGXXXXXXXXXOCK)OOOOOOC*TG 

— GTACXXXXXXXXXOOOOOOOOOGTAC XXXXXXXXXOOOOOOOOOGTAC— 



Tagl Tag2 



Tag 3 Tag 4 



. Fjevrc 9 - Serial analysis of gene expression (SAGE) analysis. cDNA is cleaved with an anchoring enzyme 
(AE) and the 3'ends captured using atreptavidih beads] Tne'cDNA pool is divided in half and each 
portion li gated to a different linker, each containing a type IIS restriction site (tagging enzyme. 
TE). Restriction with the type IIS enzyme releases the linker plus a short length of cDNA 
(XXXXX and OOOOO indicate nucleotides of different tags). The two pools of tags are then 
ligatcd and amplified using linker -specific primers. Following PCR, che products are cleaved with 

. _ _ the AE *nd the dirfffcs isolated from the linkers, using PAGE. The ditags are then ligated (during 
which process, con eaten nation occurs) and cloned into a vector of choice for sequencing. After 
Velculescu et ai. (1995), with permission." * . ._ , 
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DNA arrays 

4 Open ' differential display systems are cumbersome in that it takes a great deal 
of time to extract and identify candidate genes and then confirm that they are indeed 
up- or down-regulated in the treated compared to the control tissue. Normally, the 
latter process is carried out using Northern blotting or RT-PCR. Even so, each of 
the aforementioned steps produce a bottleneck to the ultimate goal of rapid analvsis 
of gene expression. These problems will likely be addressed by the development of 
so-called DNA arrays (e.g. Gress et al. 1992, Zhao et al. 1995. Schena et al. 1996). 
the introduction of which has signalled the next era in differential gene expression 
analysis. DNA arrays consist of a f gridded membrane or glass 'chips* containing 
hundreds or thousands of DNA spots, each consisting of multiple copies of part of 
a known gene. The genes are often selected based on previously proven involvement 
in oncogenesis, cell cycling, DNA repair, development and other cellular processes. 
They are usually chosen to be as specific as possible for each gene and animal species. 
Human and mouse arrays are already commercially available and a few companies 
will construct a personalized array to order, for example Clontech Laboratories and 
Research Genetics Inc. The technique is rapid in that hundreds or even thousands 
of genes can be spotted on a single array, and that mRNA/cDNA from the test 
populations can be labelled and used directly as probe. When analysed with 
appropriate hardware and software, arrays offer a rapid and quantitative means to 
assess differences in gene expression between two cell populations. Of course, there 
can only be identification and quantitation of those genes which are in the arrav 
(hence the term •closed 1 system). Therefore, one approach to elucidating the 
molecular mechanisms involved in a particular disease/development system mav be 
to combine an open and closed system— a DNA array to directly identify and 
quantitate the expression of known genes in mRNA populations! and an open 
system such as SSH to isolate unknown genes which are differentially expressed. 

One of the main advantages of DNA arrays is the huge number of gene fragments 
which can be put on a membrane— some companies have reported griddingup to 
60000 spots on a single glass 'chip', (microscope slide). These high density chip- 
based micro-arrays will probably become available as mass-produced off-the-shelf 
items in the near future. This should facilitate the more rapid determination of 
differential expression in time and dose-response experiments. Aside from their 
high cost and the technical complexities involved in producing and probing DNA 
arrays, the main problem which remains, especially with the newer micro-arrav 
(gene-chip) technologies, is that results are often not wholly reproducible between 
arrays. However/this problem is being addressed and should be resolved within the 
next few years. 



EST. databases as a means to identify oUffetentially.expressed genes 

Expressed sequence tags (ESTs) are partial sequences of clones obtained from 
cDNA libraries. Even though most ESTs have no formal identity (putative 
identification is the best to be hoped for), they have proven to be a rapid and efficient 
means of discovering new genes and can be- used to generate profiles of gene- 
expression^ in specific cells. Since they~were first described by Adams et al. (1991), 
there has beena huge explosion in EST production and it is estimated that there are 
now well over a million such sequences in the public domain, representing over half 
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of all human genes (Hillier et al. 1996). This large number of freelv available 
sequences (both sequence information and clones are normally available rovaity-free 
from the originators) has enabled the development of a new approach towards 
differential gene expression analysis as described by Vasmatzis et al (1998). The 
approach is simple in theory: EST databases are first searched for genes that have a 
number of related EST sequences from the target tissue of choice, but none or few 
from non-target tissue libraries. Programmes to assist in the assemblv of such sets of 
overlapping data may be developed in-house or obtained privatelv or from the 
internet. For example, the Institute for Genomic Research (TIGR, found at 
http://www.tigr.org) provides .many software tools free of charge to the scientific 
community. Included amongst these is the TIGR assembler (Sutton et al. 1995), a 
tool for the assembly of large sets of overlapping data such as ESTs, bacterial 
artificial chromosomes (BAC)s, or small genomes. Candidate EST clones rep re- 
senting different genes are then analysed using RNA blot methods for size and tissue 
specificity and, if required, used as probes to isolate and identifv the full length 
cDNA clone for further characterization. In practice however, the method is rather 
more involved, requiring bioinformatic and computer analvsis coupled with 
confirmatory molecular studies. Vasmatzis et ai (1998)"have described several 
problems in this fledgling approach, such as separating highlv homologous 
sequences derived from different genes and an overemphasis of specificitv for some 
EST sequences. However, since these problems will largely be addressed bv the 
development of more suitable computer algorithms and an increased completeness 
of the EST database, it is likely that this approach to identifying differentially 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or single cell approach ? 

When working with in vivo models of differential expression, one of the first 
issues to consider must be the presence of multiple cell types in any given specimen. 
For example, a liver sample is likely to contain not onlv hepatocvtes, but also 
(potentially) Ito cells, bile ductule cells, endothelial cells, various immune cells (e.g. 
lymphocytes, macrophages and Kupffer cells) and fibroblasts. Other tissues will 
each nave their own distinctive ceil popuianons. Aiso. m the case or neoplastic tissue, 
there are almost always normal, hyperplastic and/or dyspiastic ceils present in a 
sample. One must, therefore, be aware that genes obtained from a differential 
display experiment performed on an animal tissue model mav not necessarily arise 
exclusively from the intended 'target' cells, e.g. hepatocytes/neoplastic cells. If 
appropriate, further analyses using immunohistochemistry, in situ hybridization or 
in situ RT-PCR should be used to confirm which cell types are expressing the 
gene(s) of interest. This problem is probably most acute for those studying the 
differential expression" of genes in the "devetopmenr of different cell types, where 
there is a need to examine homologous cell populations. The problem is now being 
addressed at the National Cancer Instirnte (Bethesda, MD, USA) where new micro- 
disection techniques have been employed to assist in their gene analysis programme, 
the Cancer Genome Anatomy Project (CGAP.) fFox more information see web site.- 
htrp ://www.ncbi.nlm.nih.gov/ncicgap/intro.htmI). There are also separation tech- 
niques available that utilise cell-specific ahtigens"as a means to isolate target cells, 
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e '| g ',ono reSCenCe aCtivated cel1 sortinK ( FACS > (Dunbar et al. 1998. Kas-Deelen et 
al. 1998) and magnetic bead technology (Richard et al. 1998. R gler et al 1998) 

However, those taking a holistic approach may consider this issue unimportant 
Oiere is an equally appropriate view that all those genes showing altered expression 
within a compromized tissue should be taken into consideration. After all. since all 
tissues are complex mixes of different, interacting cell types which intimatelv 
regulate each other's growth and development, it is clear that each cell rvpe could in 
some way contribute (positively or negatively) towards the molecular mechanisms 
which lie behind responses to external stimuli or neoplastic growth. It is perhaps 
then more informative to carry out differential display experiments using in vivo as 
opposed to in vitro models, where uniform populations of identical cells probablv 
represent a partial, skewed or even inaccurate picture of the molecular changes that 
occur ° 



The incidence and possible implications of inter-individual biological variation 
should be considered in any approach where whole animal models are being used It 
is clear that indiv.duals (humans and animals) respond in different wavs to identical 
stimuli. One of the best characterized examples is the debrisoquine oxidation 
polymorphism, which is mediated by cytochrome CYP2D6 and determines the 
pharmacokinetics of many commonly prescribed drugs (Lennard 19.93. Mever and 
Zanger 1997). The reasons for such differences are varied and complex, but allelic 
variations, regulatory region polymorphisms and even physical and mental health 
can all contribute to observed differences in individual responses. Careful thought 
should, therefore, be given to the specific objectives of the srudv and to the possible 
value of pooling starting material (tissue/mRNA). The effect of this can be 
beneficial through the ironing out of exaggerated responses and unimportant minor 
fluctuations of (mechanistically) irrelevant genes in individual animals, thus 
providing a clearer overall picture of the general molecular mechanisms of the 
response. However, at the same time such minor variations mav be of utmost 
importance in deeding the ability of individual animals to succumb to or resist the 
effects of a given chemical/disease. 



Hotc efficient are differential expression techniques at recovering a high percentage of 
differentially expressed genes ? 

A number of groups have produced experimental data suggesting that mam- 
mahan cells produce between 8000-15000 different mRNA species at anv one time 
(Mechler and Rabbins 1981, Hedrick et al. 1984, Bravo 1990). although 'figures as 
high as 20-30000 have also been quoted (Axel et al. 1976). Hedrick et al (1984) 
provided evidence suggesting that the majority of these belong to the rare abundance 
class. A breakdown of this abundance distribution is shown in table 1 . 

_ When-the results of differential^isplay-experimems-have been compared with 
data obtained previously using other methods, it is apparent that not all differentially 
expressed mRNAs are represented in the final display. In particular, rare messages 
(which, importantly, often include regulatory proteins) are not easily recovered 
US ^^ . lff - erential display systems. This is araajor shortcoming, as the majority of 

• mRNA species exist at levels of less than 0:005%~oT the totiTpSpulation (table 1) 
Be £S l l ! ' tt ~ ai ( ,99S ) « xamin e<«-^ efficiener*f-BD templates (heterogeneous 
mRNA populations) for recovering rare messages and were unable to detect mRNA 
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species present at less than 1.2 % of the total mRNA population-equivalent to an 
intermediate or abundant species. Interestingly, when simple model svstems (single 
target only) were used instead of a heterogeneous mRNA population, the same 
primers could detect levels of target mRNA down to 10000 x smaller. These results 
are probably best explained by competition for substrates from the manv PCR 
products produced in a DD reaction. 

The numbers of differentially expressed mRNAs reported in the literature using 
various model systems provides further evidence that many differentiallv expressed 
mRNAs are not recovered. For example, DeRisi et aL (1997) used DNA array 
technology to examine gene expression in yeast following exhaustion of sugar in the 
medium, and found that more than 1700 genes showed a change in expression of at 
least 2-fold. In light of such a finding, it would not be unreasonable to suggest that 
of the 8000-1 5 000 different mRNA species produced by any given mammalian cell, 
up to 1000 or more may show altered expression following chemical stimulation! 
Whilst this may be an extreme figure, it is known that at least 100 genes are 
activated/upregulated in Jurkat (T-) cells following IL-2 stimulation (Ullman et al 
1990). In addition, Wan et al. (1996) estimated that interferon-y-stimuiated HeLa 
cells differentially express up to 433 genes (assuming 24000 distinct mRNAs 
expressed by the cells). However, there have been few publications documenting 
anywhere near the recovery of these numbers. For example, in using DD to compare 
normal and regenerating mouse liver, Bauer et aL (1993) found only 70 of 38000 
total bands to be different. Of these, 50% (35 genes) were shown to correspond to 
differentially expressed bands. Chen et aL (1996) reported 10 genes upregulated in 
female rat liver following ethinyl estradiol treatment. McKenzie and Drake (1997) 
identified 14 different gene products whose expression was altered bv phorbol 
mynstate acetate (PMA. a tumour promoter agent) stimulation of "a human 
myelomonocytic cell line. Kilty and Vickers (1997) identified 10 different gene 
products whose expression was upregulated in the peripheral blood leukocvtes of 
allergic disease sufferers. Linskens et aL (1995) found 23 genes differentiallv 
expressed between young and senescent fibroblasts. Techniques other than DD 
have also provided an apparent paucity of differentially expressed genes. Using SH 
for example, Cao et al. (1997) found 15 genes differentially expressed in colorectal 
cancer compared to normal mucosal epithelium. Fitzpatrick et aL ( 1 995) isolated 1 7 
genes upregulated in rat liver following treatment with the peroxisome proliferator 
clofibrate: Philips et aL (1990) isolated 12 cDNA clones which were upreeulated in 
highly metastatic mammary, adenocarcinoma cell lines compared to poorlv meta- 
static ones. Prashar and Weissman (1996) used 3' restriction fragment analvsis and 
identified approximately 40 genes showing altered expression within 4 h of 
activation of Jurkat T-cells. Groenink and Leegwater (1996) analysed 27 gene 
fragments isolated using SSH of delayed early response phase of liver regeneration 
and found only 12 to be upregulated. 

In the laboratory, SSH was used to isolate up to 70 candidate genes which appear 
to show altered expression in guinea pig liver following short-term treatment with 
the peroxisome proliferator, WY- 1 4,643 (Rockett, Swales, Esdaile and Gibson, 
unpublished observations). However, these findings have still to be confirmed by 
analysis of the extracted tissue mRNA for differential expression of these sequences. 

• ' Whilst the latest differential display techiiologicmFporported to include design 
and experimental modifications to overcome tfaisJa^ .QLefficiency (in both the total 
number of differentially expressed genes recovered and the percentage that are true 
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tfow sensitive ore differential expression technologies? 

C h fl ^ ere ^ Httle pub,ished ** addresses the issue of how large the 
change m «*««on must be for it to permit isolation of the gene in question with 
the va„o U s duTerential expression technologies. Although L isolaJonTg«« 
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experiments and animals. DD^ on the other hand, is not subject to this grey 
zone since, unlike SH approaches, it does not amplify the difference in expression 
between two samples. Wan et al. (1996) reported that differences in expression of 
twofold or more are detectable using DD. 

Resolution and visualization of differential expression products 

It seems highly improbable with current technology that a gel svstem could be 
developed that is able to resolve all gene species showing altered expression in anv 
given test system (be it SH- or DD-based). Polyacrylamide gel electrophoresis 
(PAGE) can resolve size differences down to 0.2 ° 0 (Sambrook et al. 1989) and are 
used as standard in DD experiments. Even so, it is clear that a complex series of gene 
products such as those seen in a DD will contain unresolvable components. Thus, 
what appears to be one band in a gel may in fact turn out to be several. Indeed, it has 
been well documented (Mathieu-Daude et al. 1996, Smith et al. 1997) that a single 
band extracted from a DD often represents a composite of heterogeneous products, 
and the same- has been found for SSH displays in this laboratory (Rockett et al. 
1997). One possible solution was offered by Mathieu-Daude et al. (1996), who 
extracted and reamplified candidate bands from a DD displav and used single strand 
conformation polymorphism (SSCP) analysis to confirm which components 
represented the truly differentially expressed product. 

Many scientists often try to avoid the use of PAGE where possible because it is 
technically more demanding than agarose gel electrophoresis (AGE). Unfortunately, 
high resolution agarose gels such as Metaphor (FMC, Lichfield, UK) and AquaPor 
HR (National Diagnostics, Hessle, UK), whilst easier to prepare and manipulate 
than PAGE, can only separate DNA sequences which differ in size bv around 
1.5-2 °' 0 (15-20 base pairs for a 1Kb fragment). Thus, SSH, RDA or other such 
products which differ in size by less than this amount are normally not resolvable. 
However, a simple technique does in fact exist for increasing the resolving power of 
AGE— the inclusion of HA-red (10-phenyl neutral red-PEG ligand) or HA-vellow 
(bisbenzamide-PEG ligand) (Hanse Analytik GmbH, Bremen, Germany) in a 
gel separates identical or closely sized products on base content. Specifically. 
HA-red and -yellow selectively bind to GC and AT DNA motifs, respectiveiv 
(Wawer et al. 1995, Hanse Analytik 1997. personal communication). Since both 
HA-stains possess an overall positive charge, they migrate towards the cathode 
when an electric field is applied. This is in direct opposition to DNA, which 
is negatively charged and, therefore, migrates towards the anode. Thus] if two 
DNA clones are identical in size (as perceived on a standard high resolution 
agarose gel), but differ in AT/GC content, inclusion of a HA-dye in the gel 
will effectively retard the migration of one of the sequences compared to the 
other, effectively making it apparently larger and, thus, providing a means of 
differentiating between the two. The use of HA-red has been shown to resolve 
sequences with an AT variation of less than 1 % (Wawer et al. 1995), whilst Hanse 
Analytik have reported that HA staining is so sensitive that in one case it was used 
to distinguish two 567bp sequences which<liffered by only a single point mutation 
(Hanse Analytik 1996, personal communication). Therefore, if one wishes to check 
whether all the clones produced from a specific band in a differential display 
-expenmenture derived from the- same ge ne-s p ecies, a small-amount of reamplified 
or digested clone can be run on a standard high resolution gel, and a second aliquot 
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F,gure 10. Discrimination of clones of identical/nearly identical size using HA-red. Bands of decreasint 
e«erimii,7H TT? ^ ^ *"* Xty ° f * "W^*" '"biractive hybridization^ 
in~Z^L2A £*%?° l0n '? W " e Picked " nndomf ™™ ««h cloned band and their 
« Tnd^ K K ,lng , CR - Tt" P^OduC,, WCre nm on ^ «•■•• < A > • hi « h "solution 2 % agarose 
fhi'^ll ' 8h T! "T 2 0 * 8 " 0t * " l con « ini "« » «-7ml HA-red. With few exceptions. .11 
~. m \ T m t aPPe " ,0 * " mC ,Ue A). However, the presence of HA-red 

gel B). wh,ch separates identically-sized DNA fragments based on the percentage of GC within 
the sequence, clearly indicate, the pre«nce of different gene specie, within each band. For 
example, even though all five re-amplified clone, of band 1 appear to be the same size, at least four 
different gene specie, are represented. 

in a similar gel containing one of the HA-stains. The standard gel should indicate 
any gross size differences, whilst the HA-stained gel should separate otherwise 
unresolvable species (on standard AGE) according to their base content. Geisinger 
et al. (1997) reported successful use of this approach for identifying DD-derived 
clones. Figure 10 shows such an experiment carried out in this laboratory on clones 
obtained from a band extracted from an SSH display. 

An alternative approach is to carry out a 2-D analysis of the differential display 
products. In this approach, size-based separation is first carried out in a standard 
agarose gel. The gel slice containing the display is then extracted and incorporated 
in to a HA gel for resolution based on AT/GC content. 

Of course, one should always consider the possibility of there being different 
gene species which are the same size and have the same GC/ AT content. However 
c V c^D theSe SPeC ' eS " 0t unresolvab,e Siven some effort— again, one might use 
SSCP, or perhaps a denaturing gradient gel electrophoresis (DGGE) or temperature 
gradient field electrophoresis (TGGE) approach to resolve the contents of a band 
either directly on the extracted band (Suzuki- et al. 1991) or on the reamplified 
product. 

The requirement of some differential display techniques to visualize large 
numbers of products (e.g. DD and GEF) can also present a problem in that, in terms 
of numbers, the resolution of PAGE rarely exceeds 300-400 bands. One approach to 
overcoming this might be to use^-D^ris sucrraTlhWclescribed by Uirterlinden et 
al. (1989) and Hatada et.al. (1991). - — : _ - - 
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Extraction of differentially expressed bands from a gel can be complex since in 
some cases (e.g. DD, GEF), the results are visualized by autoradiographic means 
such that precise overlay of the developed film on the gel must occur if the correct 
band is to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem, and that of the use of radioisotopes, 
has been addressed by several groups. For example, -Lohmann et al. (1995) 
demonstrated that silver staining can be used directly to visualize DD bands in 
horizontal PAGs. An et al (1996) avoided the use of radioisotopes bv transferring a 
small amount (20-30%) of the DNA from their DD to a nylon membrane, and 
visualizing the bands using chemiluminescent staining before going back to extract 
the remaining DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polvdT 
primers used in the differential display procedure). Differentially expressed bands 
were cut from the membrane and the DNA eluted by washing with PCR buffer prior 
to reamplification. 

One of the advantages of using techniques such as SSH-and RDA is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 
bromide staining. Whilst this approach can provide acceptable results, overstating 
with SYBR Green I or SYBR Gold nucleic acid stains (FMC) effectively enhances 
the intensity and sharpness of the bands. This greatly aids in their precise extraction 
and often reveals some faint products that may otherwise be overlooked. Whilst 
differential displays stained with SYBR Green I are better visualized using short 
wavelength UV (254 nm) rather than medium wavelength (306 nm), the shorter 
wavelength is much more DNA damaging. In practice, it takes only a few seconds 
to damage DNA extracted under 254 nm irradiation, effectively preventing 
reamplification and cloning. The best approach is to overstain with SYBR Green I 
and extract bands under a medium wavelength UV transillumination. 



The possible use of 1 microflngerprinting ' to reduce complexity 

Given the sheer number of gene products and the possible complexity of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis of a small section of a differential display— a * sub-fingerprint' or \micro- 
nngerpnnt '. In this case, one couid concentrate on those bands which oniy appear 
in a particular chosen size region. Reducing the fingerprint in this wav has at least 
two advantages. One is that it should be possible to use different gel types, 
concentrations and run times tailored exactly to that region. Currently, one might 
run products from 1 00-3000 + bp on the same gel, which leads to compromize in the 
gel system being used and consequently to suboptimal resolution, both in terms of 
size and numbers, and can lead to problems in the accurate excision of individual' 
bands. Secondly, it may be possible to enhance resolution by using a 2-D analysis 
using a HA-stain, as described earlier. In summary, if a range of gene product sizes 
is carefully chosen to included certain * relevant ' genes, the 2-D system standardized, 
and appropriate gene analysis used, it may be possible to develop a methodfor the 
early and rapid identification of compounds which have similar or widely different 
cellular effects. If the prognosis for exposure to one or more other chemicals which 
display, a similar_profile is already : _known, then one cou ld perhaps predict similar 
effects for any new compounds which show asimilar micro-fingerprint. 
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^MA"^^ 9 - S ° me f - thC - 8helf ° NA a ^ s Clontech's 

Atlas cDNA Express.on Array se „es) already anticipated this to some decree bv 



Screening 

False positives 



diffUlfZ V r fa,Seposiaves has been discussed at length amongst the 
d.fferenuald.splavcommumry (Liang 1993. 1995, Nishio^lW sLetal 
994. Sompayrac et al. 1995). The reason for false positives varies w"th the 

betHPLcluf.fied' 1 - ^ T'T' ^ ^ ° f W ^ ^ ™ 

been HPLC punfied can lead to the production of false positives through illegitimate 

PcTaXr ( ?? e,H ^ Sindair 1997) ' WhHst in DD can aris Irough 
to be d Tr ve d r ' g ; tema w transcri P» on ° f rRNA. In SH, false positives appear 
rDNA/mRNi 3 ' 8 V fr0 T. a ? Undsmt *«* s P ecies - al *hough some mav arise from 

K ou?J^, SPCC,eS f Wh,Ch d ° n0t h V bri <l^on for technical reasons 

usint ! . S . T'l!f ° f PUt " iVe differentia "y «P^sed clones can be carried out 
usmg a s,mple dot blot approach, in which labelled first strand probes svnthesized 

V ^ ^ hybridi " d t0 of «W clone, (Hedrick et 

testlr or'oS J IT ^ * ^ Differential * «P"~- clones will hvb ridize to 
tester probe, but not dnver. The disadvantage of this approach is that rare soecies 

original mRNA and confirm the altered expression usine a more quantise 

Z^'f^Tt ^ bC aCh,CVCd USmg N °" hem -lots, the seZv?", 
poor by today s h.gh standards and one must rely on PCR methods for accurate and 
sensitive determinations (see below). ur accurate ana 
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betweenTSTlld ^ P rocedures prod «ce final products which are 

between 100 and lOOObp ,n size. However, this may considerably reduce the size of 
the sequence for analysis of the DNA databases. This in rum Lads to a reduced 
confidence ,n the result-several families of genes have members whose DNA 
^""ces »re almosr idenricd-tsccepr-t u J- few key stretches, e.g. the cytochrome 
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bvitr?T U f" re8U !f ted w n IiVCr0f msex P° sed » Wy-14.643 and was identified 
by a FASTA search as being transferrin (data not shown). However, transferrin's 

14.643 (Hem et al. 1996). and this was confirmed with subsequent RT-PCR 
analyse. This suggests that the gene sequence isolated mav belong to a gene which 

closely related to transferrin, but is regulated by a different mechanism 
h,fiT S prob, f m . associa "d with SH technology is redundancy. In most cases 

d gSon ?hTls ° Utl f NA r° PU,ati0n mUSt fifSt bC Sim " ified * -"'^n 
aigestion. This is important for.at least two reasons : 

(1) To reduce complexity-long cDNA fragments mav form complex networks 
which prevent the formation of appropriate hybrids, especially at " high 
concentrations required for efficient hybridization 

(2) Cutting the cDNAs into small fragments provides better representation of 
mdividual genes. This is because genes derived from related but d st"nc 

h>bnd,ze and be eliminated during the subtraction procedure (Ko 1990) 

in te^?K C, K ! CrCnt fra8mCntS £f0m tHc S3me cD * A ma >" d i^r considerably 
or h^th, h ^. r,dl2at *° n D and am P lifica «°" '"d. thus, may not efficiently do one 
excess d Dn7 3 1°^ '"^ ^ S °™ fragmentS from ^erentiallv 
ceTres How ™i ' f e,,mmated durin « ^tractive hybridization pro- 
cedures. However, other fragments may be enriched and isolated. As a 
consequence of th.s. some genes will be cut one or more times, giving rise to two 
or more fragments of different sizes. If those same genes are differential 
expressed, then nvo or more of the different size fragments may comY nr2 

edu^T" ° n final differCntia, di5 P ,a ^ inc ™ si «S «he observed 

redundancy and increasing the number of redundant sequencing reactions. 

of -IT*?" C ° mparis ° ns a,so throw "P a n«her important point-at what degree 
of sequence similarity does one accept a result. Is 90% identitiv between a genl 
denved from your model species and another acceptably closeMs 9S'Tb!£Z 
your sequence and one from the same species also acceptable? This prob.l £ 
particularly relevant when the forward and reverse sequence comparison Tive 

^■srsii* comp,ete,y differem gene spec,es! An 

^ Z I ! ?CnCS mat derin,te < 95 °» and above «milann-, and then 

group those between 60 and 95 % as being related or possible homologues. 

Quantitative analysis 

' ^17' POint, • mUSt giVC conside ™°n "> the quantitative analysis of the 
candidate genes, either as a means of confirming that they are truly differentially 
expressed, or ,n order to establish just what the differences are. NoSS blot 
analysis ,s a popular approach as it is relatively easy and quick to perform. However 
he major drawback with Northern blots is that they are often n« sensitive enough 
to detect rare sequences. Since the majority of messages expressed in a cell are of low 

a ;T ,T C f C r , 1 1 '!? " 3 maj ° r Pr ° b,em - Con «q^ntly. RT-PCR may be the 

-method of choicc-forconnmwdhfaenti .il l ab i um. A lthough the procedure is 

rrbaITT C ° mP " NOrthem ana,ySiS ' reqUi " ng Synthesis °' P"mers *Vd 
optunaation of reaction condmons for each gene species, it is now possible to set up 

high throughput PCR systems-using mulitchanneTpIpettes. 96 + -well plates and 



684 



J. C. Rockett et al. 



approbate thermal cycling technology. Whilst quantitative analysis is more 
durable, bemg more accurate and without reliance on an internal stand rd* 
money and t,me needed to develop a competitor molecule is often exces i e 
especially when one might be examining tens or even hundreds of gene species The" 

must first of all choose an .mernal standard that does not change in the test cell! 
ompared to the controls. Numerous reference genes have been tried in The past for 

Vr?T; gamma (IFN ' 7 ' Ffye et ° L 1989 >' P-*™ (Heuval etalZ'^ 
glyceraldehyde-3-phosphate dehydrogenase (GAPDH. Wong et al. 1994) di- 
h>drofolate reductase (DHFR. Mohler and Butler 1991). /^-microglobulin V 
m. Murphy et al. 1990). hypoxanthine phosphoribosyl transferase (HPRT. Foss 'et 

sLr? * anUm u er ° f ° therS < ClonTec hniq«« 1997b). Ideallv. an internal 
tandard should not change its level of expression in the cell regardless of cell a« 
stage « the cell cycle or through the effects of external stimuli. However, it has been 
shown on numerous occasions that the levels of most housekeeping gen^ currentlv 
u ed by the research community do in fact change under certain conditions a! 
duTerent nssues (ClonTechniques 1997b). It is imperative, therefore. tha^cre" 
hmmarv expenments be carried out on a panel of housekeeping genes to establ 'h 
their suitabthty for use in the model svstem. estaousn 

.nJ^'TT of f c ' uantitative daia ™" also be treated with caution. Bv 
comparing the hsts of genes identified by differential expression one can perhaps 
For ex^ "act in different ways to external srimX 

ran*! ofT ^ aPPCar SCnSitiVe IO the non -8enotoxic effects of a wide 

range of perox,some prohferators whilst Syrian hamsters and guinea pigs are largely 
res,stant (Orton et al. 1984. Rodricks and Turnbull 1987. Lake et al 1989 T? 9 
Makowska et al 1992). A s.mp.ified approach to resolving the reason(s) whyi! t * 
xores^Id , UP " ^ d0Wn -^ ulated ^nes in order identify those which a" 
expressed m only one spec.es and. through background knowledge of the effects of 
thesa,dgene.m.ghtsugge S tamechanism^ 

or protecuon. Of course, the situation is likely to be far more complex. Perhaps f 
there were one key gene protecting guinea pig from non-genotoxic effects and it was 
•nT u 3 ° l,mCS by PPS> thC SamC gene mi * ht on, V b < uP-regulated five t m 
*ene"mav ?T IT ?°* ^ W be the fmportance oZl 

gene may be overlooked. Just to complicate matters, a laree change in expression 
does not necessanly mean a biologically important change. For example, what U he 
true relevance of gene Y which shows a SO.fold increase after a particula trea^Vnt 

Sid £v 9 ^""Y™ ha * often been shown to be up-regulated 40-40- 
fold by a number of unrelated ,timuli-in lighroTihis the 50-fold increase would 

T H0 T er ' thC ,itCratUre may Sh ° w that * ene 2 ha * "ever been 

recorded as havmg more than doubled in expression-which makes your S-fold 
increase all the more exciting. Perhaps even more interesting is if that same S-fold 
chemicals" 8 " ne0pI5s " mTo " r fo »^ing treatment with related 

Problems 15 using Oie diflerentiaT display approach 

Differential display technology originally held.promise of an easily obtainable 
nngerpnnt of those genes which are up- or down-regulated in test animals/cells in 
a developmental process or following exposure to given stimuli. However it has 
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become clear that the fingerprinting process, whilst still valid, is much too complex 
to be represented by a single technique profile. This is because all differential displav 
techniques have common and/or unique technical problems which preclude the 
isolation and identification of all those genes which show changes in expression. 
Furthermore, there are important genetic changes related to disease development 
which differential expression analysis is simply not designed to address. An example 
of this is the presence of small deletions, insertions, or point mutations such as those 
seen in activated oncogenes, tumour suppressor genes and individual polv- 
morphisms. Polymorphic variations, small chough they usuallv are, are often 
regarded as being of paramount importance in explaining wh'v some patients 
respond better than others to certain drug treatments (and. in logical extension, whv 
some people are less affected by potentially dangerous xenobiotics/carcinogens than 
others). The identification of such point mutations and naturallv occurring 
polymorphisms requires the subsequent application of sequencing, SSCP, DGGE 
or TGGE to the gene of interest. Furthermore, differential displav is not designed 
to address issues such as alternatively spliced gene species' or whether an increased 
abundance of mRNA is a result of increased transcription or increased mRNA 
stability. 



Conclusions 

Perhaps the main advantage of open system differential display techniques is that 
they are not limited by extant theories or researcher bias in revealing genes which are 
differentially expressed, since they are designed to amplify all genes which 
demonstrate altered expression. This means that they are useful for the isolation of 
previously unknown genes which may turn out be useful biomarkers of a particular 
state or condition. At least one open system (SAGE) is also quantitative, thus 
eliminating the need to return to the original mRNA and carrv out Northern/PCR 
analysis to confirm the result. However, the rapid progress of genome mapping 
projects means that over the next 5-10 years or so, the balance of experimental use 
will switch from open to closed differential display systems, particularly DNA 
arrays. Arrays are easier and faster to prepare and use, provide quantitative data, are 
suitable for high throughput analysis and can be tailored to look at specific signalling 
pathways or families of genes. Identification of all the gene sequences in human and 
common laboratory animals combined with improved DNA array technology, 
means that it will soon no longer be necessary to try to isolate differentially expressed . 
genes using the technically more demanding open system approach. Thus, their 
..main advantage (that of identifying unknown genes) will be largely eradicated. It is 
likely, therefore, that their sphere of application will be reduced to analysis of the 
less common laboratory species, since it will be some time yet before the genomes of 
such animals as zebrafish, electric eels, gerbils, crayfish and squid, for example, will 
be sequenced. 

Of course, in the end the question will always remain: What is the functional/ 
biological significance of the identified, differentially expressed genes? One 
• persistent problem is understanding whether differentially expressed genes are a 
_ cause or consequence of .the altered state. Furthermore, many chemicals, such as 
non-genotoxic carcinogens, are also mitogens and so genes associated with 
replication will also be upregulated but may have little or nothing to do with the 
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ABSTRACT The recent ability to sequence whole genomes 
allows ready access to all genetic material. The approaches 
outlined here allow automated analysis of sequence for the 
synthesis of optimal primers in an automated multiplex 
oligonucleotide synthesizer (AMOS). The efficiency is such 
that all ORFs for an organism can be amplified by PCR The 
resulting amplicons can be used directly in the construction of 
DNA arrays or can be cloned for a large variety of functional 
analyses. These tools allow a replacement of single-gene 
analysis with a highly efficient whole-genome analysis. 

The genome sequencing projects have generated and will 
continue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cerevisiae, Escherichia coli, Hae- 
mophilus influenzae (1), Mycoplasma genua hum (2), and Meth- 
anococcus jannaschii (3) have been completely sequenced. 
Other model organisms have had substantial portions of their 
genomes sequenced as well, including the nematode Caeno* 
rhabditis elegans (4) and the small flowering plant Arabidopsis 
thaliana (5). This massive and increasing amount of sequence 
information allows the development of novel experimental 
approaches to identify gene function. 

One standard use of genome sequence data is to attempt to 
identify the functions of predicted open reading frames 
(ORFs) within the genome by comparison to genes of known 
function. Such a comparative analysis of all ORFs to existing 
sequence data is fast, simple, and requires no experimentation 
and is therefore a reasonable first step. While finding sequence 
homologies/motifs is not a substitute for experimentation, 
noting the presence of sequence homology and/or sequence 
motifs can be a useful first step in finding interesting genes, in. 
designing experiments and, in some cases, predicting function. 
However, this type of analysis is frequently uninformative. For 
example, over one-half of new ORFs in S. cerevisiae have no 
known function (6). If this is the case in a well studied organism 
such as yeast, the problem will be even worse in organisms that 
are less well studied or less manipulate. A large, experimen- 
tally determined gene function database would make homol- 
ogy /motif searches much more useful. 

Experimental analysis must be performed to thoroughly 
understand the biological function of a gene product. Scaling 
up from classical "cottage industry" one-gene-oriented ap- 
proaches to whole-genome analysis would be very expensive 
and laborious. It is clear that novel strategies are necessary to 
efficiently pursue the next phase of the genome projects— 
whole-genome experimental analysis to explore gene expres- 
sion, gene product function, and other genome functions. 
Model organisms, such as 5. cerevisiae, will be extremely 

The publication costs of this article were defrayed in part bv page charge 
paymeni. This article must therefore be herebv marked "advenisement" in 
accordance with 18 U.S.C. 51734 solely to indicate this fact. 

O 1997 by The National Academy of Sciences 0027-8424 /97/94H945-3S2.00/0 
PNAS is available online at http://www.pnas.org. 



important in the development of novel whole-genome analysis 
techniques and, subsequently, in improving our understanding 
of other more complex and less manipulate organisms. 

The genome sequence can be systematically used as a tool 
to understand ORFs, gene product function, and other ge- 
nome regions. Toward this end, a directed strategy has been 
developed for exploiting sequence information as a means of 
providing information about biological function (Fig. 1). Ef- 
forts have been directed toward the amplification of each 
predicted ORF or any other region of the genome ranging 
from a few base pairs to several kilobase pairs. There are many 
uses for these amplicons— they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into 
other specialized vectors such as those used for two-hybrid 
analysis. The amplicons can also be used directlv by, for 
example, arraying onto glass for expression analysis, for DNA 
binding assays, or for any direct DNA assay (7). As a pilot 
study, synthetic primers were made on the 96-well automated 
multiplex oligonucleotide synthesizer (AMOS) instrument (8) 
(Fig. 2). These oligonucleotides were used to amplify each 
ORF on yeast chromosome V. The current version of this 
instrument can synthesize three plates of 96 oligonucleotides 
each (25 bases) in an 8-hr day. The amplification of the entire 
set of PCR products was then analyzed by gel electrophoresis 
(Fig. 3). Successful amplification of the proper length product 
on the first attempt was 95%. This project demonstrates that 
one can go directly from sequence information to biological 
analysis in a truly automated, totally directed manner. 

These amplicons can be incorporated directly in arrays or 
the amplicons can be cloned. If the amplicons are to be cloned, 
novel sequences can be incorporated at the 5' end of the 
oligonucleotide to facilitate cloning. One potential problem 
with cloning PCR products is that the cloned amplicons may 
contain sequence alterations that diminish their utility. One 
option would be to resequence each individual amplicon. 
However, this is expensive, inefficient, and time consuming. A 
faster, more cost-effective, and more accurate approach is to 
apply comparative sequencing by denaturing HPLC (9). This 
method is capable of detecting a single base change in a 2-kb 
heteroduplex. Longer amplicons can be analyzed by use of 
appropriate restriction fragments. If any change is detected in 
a clone, an alternate clone of the same region can be analyzed. 
Modifying the system to allow high throughput analysis by 
denaturing HPLC is also relatively simple and straightforward. 

If amplicons are used directly on arrays without cloning, it 
is important to note that, even if single PCR product bands are 
observed on gels, the PCR products will be contaminated with 
various amounts of other sequences. This contamination has 
the potential to affect the results in, for example, expression 
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94555. 
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Fig. 1. Overview of systematic method for isolating individual 
. genes. Sequence information is obtained automatically from sequence 
databases. The data are input into primer selection software specifi- 
cally designed to target ORFs as designated by database annotations. 
The output file containing the primer information is directly read by 
a high-throughput oligonucleotide synthesizer, which makes the oli- 
gonucleotides in 96-well plates (AMOS, automated multiplex oligo- 
nucleotide synthesizer). The forward and reverse primers are synthe- 
sized in the same location on separate, plates to facilitate the down- 
stream handling of primers. The amplicons are generated by PCR in 
96-well plates as well. 

analysis. Oh the other hand, direct use of the amplicons is 
much less labor intensive and greatly decreases the occurrence 
of mistakes in clone identification, a ubiquitous problem 
associated with large clone set archiving and retrieving. 

Any large-scale effort to capture each ORF within a genome 
must rely on automation if cost is to be minimized while 
efficiency is maximized. Toward that end, primers targeting 
ORFs were designed automatically using simple new scripts 
and existing primer selection software. These script-selected 
primer sequences were directly read by the high-throughput 
synthesizer and the forward and reverse primers were synthe- 
sized in separate plates in corresponding wells to facilitate 
automated pipetting and PCR amplifications. Each of the 
resulting PCR products, generated with minimum labor, con- 
tains a known, unique ORF. 

Large-scale genome analysis projects are dependent on 
newly emerging technologies to make the studies practical and 
economically feasible. For example, the cost of the primers, a 
significant issue in the past, has been reduced dramatically to 
make feasible this and other projects that require lens of 
thousands of oligonucleotides. Other methods of high- 
throughput analysis are also vital to the success of functional 
analysis projects, such as microarraying and oligonucleotide 
chip methods (10-14). 

Changes in attitude are also required. One of the major costs 
of commercial oligonucleotides is extensive quality control 
such that virtually 100% of the supplied oligonucleotides are 
successfully synthesized and work for their intended purpose. 
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Fig. 2. Overall approach for using database of a genome to direct 
biological analysis. The synthesis of the 6.000 ORFs (orfs) for each 
gene of S. ccrcxisiae can be used in many applications utilizing both 
cloning and microarraying technology. 

Considerable cost reduction can be obtained by simply de- 
creasing the expected successful synthesis rate to 95-97%. One 
can then achieve faster and cheaper whole genome coverage by 
simply adding a single quality control at the end of the 
experiment and batching the failures for resynthesis. 

The directed nature of the amplicon approach is of clear 
advantage. The sequence of each ORF is analyzed automati- 
cally, and unique specific primers are made to target each 
ORF. Thus, there is relatively little time or labor involved— for 
example, no random cloning and subsequent screening is 
required because each product is known. In the test system, 
primers for 240 ORFs from chromosome V were systematically 
synthesized, beginning from the left arm and continuing 
through to the right arm. At no point was there any manual 
analysis of sequence information to generate the collection. In 
many ways, now that the sequence is known, there is no need 
for the researcher to examine it. 

These amplicons can be arrayed and expression analysis can 
be done on all arrayed ORFs with a single hybridization (10).- 
Those ORFs that display significant differential expression 
patterns under a given selection are easily identified without 
the laborious task of searching for and then sequencing a clone. 
Once scaled up, the procedure provides even greater returns 
on effort, because a single hybridization will ultimately provide 
a "snapshot" of the expression of all genes in the yeast genome. 
Thus, the limiting factor in whole genome analysis will not be 
the analysis process itself, but will instead be the ability of 
researchers to design and carry out experimental selections. 

Current expression and genetic analysis technologies are 
geared toward the analysis of single genes and are ill suited to 
analyze numerous genes under many conditions. Additional 
difficulties with current technologies include: the effort and 
expense required to analyze expression and make mutants, the 
potential duplication of effort if done by different laboratories, 
and the possibility of conflicting results obtained from differ- 
ent laboratories. In contrast, whole genome analysis not only 
is more efficient, it also provides data of much higher quality; 
all genes are assayed and compared in parallel under exactly 
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the same conditions. In addition, amplicons have many appli- 
cations beyond gene expression. For example, one' recent, 
approach is to incorporate a unique DNA sequence tag, 
synthesized as part of each gene specific primer, during 
amplification. The tags or molecular bar codes, when reintro- 
duced into the organism as a gene deletion or as a gene clone, 
can be used much more efficiently than individual mutations 
or clones because pools of tagged mutants or transformants 
can be analyzed in parallel. This parallel analysis is possible 
because the tags are readily and quantitatively amplified even 
in complex mixtures of tags (13). 

These ORF genome arrays and oligonucleotide tagged 
libraries can be used for many applications. Anv conventional 
selection applied to a library that gives discrete or multiple 
products can use these technologies for a simple direct read- 
out. These include screens and selections for mutant comple- 
mentation, overexpression suppression (15, 16), second-site 
suppressors, synthetic lethality, drug target overexpression 
(17), two-hybrid screens (18), genome mismatch scanning (19), 
or recombination mapping. 

The genome projects have provided researchers with a vast 
amount of information. These data must be used efficiently 
and systematically to gain a truly comprehensive understand- 
ing of gene function and, more broadly, of the entire genome 
which can then be applied to other organisms. Such global 
approaches are essential if we are to gain an understanding of 
the living cell. This understanding should come from the 
viewpoint of the integration of complex regulatory networks, 
the individual roles and interactions of thousands of functional 
gene products, and the effect of environmental changes on 
both gene regulatory networks and the roles of all gene 
products. The time has come to switch from the analysis of a 
single gene to the analysis of the whole genome. 

Support was provided by National Institutes of Health Grants 
R37H60198 and P01H600205. 
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The availability of genome-scale DNA sequence information and reagents has radically altered life <ri*n« 
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INTRODUCTION 
Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 
cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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[13,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and CyS-dlTTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe/' are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15], The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10, 1 1,15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 6 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high-intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34], This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain S. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 
Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 
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tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech-, 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 

Treated 
Population 



RNA Isolation 



Cy3 ^) 

y — Reverse ^ ft Cy5 

Transcription jr 



DNA "Chip" 



7 



Q Mix cDNAs and 
Q Apply to Array 



Hybridize Under 
Coverslip 




Figure 1. Simplified overview of the method for sample 
preparation and hybridization to cDNA microarrays. For illus- 



trative purposes, samples derived from cell culture are depicted 
although other sample types are amenable to this analysis. ' 
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Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action, in this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome prol iterators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tpx- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cDNA Microarray 
Chip Designed to Detect Responses to Toxic Insult 
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Gene category 



No. of genes 
on chip 



Apoptosis 

DNA replication and repair 
Oxidative stress/redox homeostasis 
Peroxisome proliferator responsive 
Dioxin/PAH responsive 
Estrogen responsive 
Housekeeping 

Oncogenes and tumor suppressor genes 

Cell-cycle control 

Transcription factors 

Kinases 

Phosphatases 

Heat-shock proteins 

Receptors 

Cytochrome P450s 



72 
99 
90 
22 
12 
63 
84 
76 
51 
131 
276 
88 
23 
349 
30 



*This list is intended as a general guide. The gene categories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation/the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45], 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42]. 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 
There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health. 
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Abstract ' 

Recent progress in genomics and proteomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints 
indicative of a drug's efficacy and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to profile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacy and safety in pre-clinical and clinical studies based on biologically relevant tissue and surrogate markers 
© 2000 Elsevier Science Ireland Ltd. All rights reserved. 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 

Expression data at the mRNA level can be 
produced using a set of different technologies 
such as DNA microarrays, reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP), serial analysis of gene expression 
(SAGE) and others. Currently, DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either, by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al., 1995; Shalon et a]., 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et al, 
1991; Chee et al., 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



3. Global protein profiling 

Global quantitative expression analysis at the 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation of tissue proteins by 
isoelectric focusing in the first dimension and by 
sodium dodecyl sulfate slab gel electrophoresis- 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et al., 1991). 
The product is a rectangular pattern of protein 
spots that are typically revealed by Copmassie 
Blue, silver or fluorescent staining (Fig. 2). 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et ah, 1993) and sequence tags (Wilkins et 
al., 1996). Similar to the mRNA approach, the 
ratio between the optical density of spots from 
control and treated samples are compared to 
search for treatment-related changes. 



4. Expression data analysis 

Bioihformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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Fig. 1. Production of an active protein is a.multistep process in which numerous regulation systems exert control at various stages 
of expression. Molecular fingerprints of drugs can be visualized through expression profiling at the mRNA level (genomics) using 
a variety of technologies and at the protein level (proteomics) using two-dimensional gel electrophoresis. 
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Fig. 2 Computerized representation of a Coomassie Blue stained two-dimensional gel electrophoresis pattern of Fischer F344 i 
liver homogenate. , 



quantitative expression data has been collected, is 
to visualize complex patterns of gene expression 
changes, to detect pathways and sets of genes 
tightly correlated with treatment efficacy and toxi- 
city, and to compare the effects of different sets of 
treatment (Anderson et al., 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that may be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
sis. Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefractionation of sam- 
ples. The expression of such genes may be prefer- 
ably quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation^ Tissue biopsy samples typically yield good 
quality of both mRNA and proteins; however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA when compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very 'meaningful', and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
translationai modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer, 1997) further 
suggests that the two approaches, mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 

6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drug 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
al., 1993; Steiner et al. ? 1996b; Aicher et al., 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et al., 1991, 
1995, 1996; Steiner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al, 1998). In later phases of drug devel- 



opment, surrogate markers of treatment efficacy 
and toxicity can be applied to optimize the moni- 
toring of preclinical and clinical studies (Doherty 
et al., 1998). 



7. Perspectives 

The basic methodology of safety evaluation has 
changed little during the past decades. Toxicity in 
laboratory animals has been evaluated primarily 
by using hematological, clinical chemistry and 
histological parameters as indicators of organ 
damage. The rapid progress in genomics and pro- 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efficacy and lower toxicity. 
The identification of biologically relevant surro- 
gate markers correlated with treatment efficacv 
and safety bears a great potential to optimize the 
monitoring of pre-clinical and clinical trails. 
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Decoding the genetic blueprint is a dream that 
oners manifold returns in terms of understand- 
ing how organisms develop and function in an 
often hostile environment. With the rapid 
advances in molecular biology over the last 30 
years, the dream has come a step closer to reali- 
ty. Molecular biologists now have the ability to 
elucidate the composition of any genome. 
Indeed, almost 20 genomes have already been 
sequenced and more than 60 are currently, 
under way. Foremost among these is the 
Human Genome Mapping Project. However, 
the genomes of a number or commonly used 
laboratory species are also under intensive 
investigation, including yeast. Arabidopsis* 
maize, rice, zebra fish, mouse, rat. and dog. It 
is widely expected that the completion of such 
programs will facilitate the development of 
manv powerful new techniques and approach- 
es to uiagnosuis; ana treating grneucaUv and 
environmentally induced riiwiw: which afnict 
mankind. However, the vast amount of data 
being generated by genome mapping will 
require new high-throughput technologies to 
investigate the function of the millions of new 
genes that are being reported Among the most 
widely heralded of the new functional 
genomics technologies arc DNA arrays, which 
represent perhaps the most anticipated new 
molecular biology technique since polymerase 
chain reaction (PCR). . 

Arrays enable the study of literally thou- 
sands of genes in a single experiment. The 
potential importance of arrays is enormous and 
has been highlighted by the recent publication 
of an entire Nature Gtneaa supplement dedi- 
cated to the technology (/). Despite this huge 
surge of interest, DNA arrays are still litdc used 
and largely unproven. as demonstrated by the 
high ratio of review and press articles to amial 
data papers. Even so, the potential they offer 



has driven venture capitalists into a frenzy of 
investment and many new companies are 
springing up to claim a share of this rapidly 
developing market. 

The U.S. Environmental Protection 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproductive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
(NHEERL; Research Triangle Park. NC) 
hosted a workshop on "Application of 
Microarrays to Toxicology" on 7-S January 
1999 in Research Triangle Park. North 
Carolina. The workshop was organized bv 
David Dix, Robert Kavlock and John Rockett 
of the RTD/NHEERL Twentv -two intra- 
mural and extramural scientists from govern- 
ment, acaaernia, ana inaustrv shared inrorrna- 
tion. data, and opinions on the current and 
future applications tor this exciting new tech- 
nology. The workshop had more than 1 50 
attendees, including researchers, students, and 
-administrators from the EPA, the National 
Institute of Environmental Health Sciences 
(NIEHS). and a number of other establish- 
ments from Research Triangle Park and 
beyond. Presentations ranged from the tech- 
nology behind array production through the 
sharing of actual experimental data and projec- 
tions on the future importance and applica- 
tions of arrays. The information contained in 
the workshop presentations should provide aid 
and insight into arrays in general and their 
application to toxicology in particular. 

Array Elements 

In the context of molecular biology, the word 
"array" is normally used to refer to a series of 
DNA or protein elements firmly attached in 



a regular pattern to some kind of supportive 
medium. DNA arrav is often used inter- 
changcablv with gene array or microarray. 
Although nor formaiiv denned, microarrav is 
generally used to describe the higher density 
arrays typically printed on glass chips. The 
DNA elements that make up DNA arravs 
can be oligonucleotides, parrial cene 
sequences, or rull-tength cDNAs. Companies 
ortering pre-made arrays that contain less 
than full-length ciones normally use regions 
ot the genes which are specific to that gene to 
prevent false positives arising through cross- 
hybridization. Sequence verification of 
cDNA clone identity is nccessarv because of 
errors in identifying specific clones from 
cDNA libraries and databases. Premade 
DNA arrays printed on membranes are cur- 
rently or imminently available for human, 
mouse, and rat. In most cases they contain 
DNA sequences representing several thou- 
sand different sequence clusters or genes as 
delineated through the National Center for 
Biotechnology Information UniGene Project 
(3. Many of these different UniGene dusters 
(putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arrays are typically* printed on one of two 
types of support matrix. Nylon membranes 
are used by most off-the-shelf array providers 
such as Clontech Laboratories, Inc. 
(Palo Alto. CA). Genome Systems. Inc. (St. 
Louis. MO), and Research Genetics. Inc. 
t'Huntsville. AL). Microarravs such as those 
produced by Affymetnx. Inc. i Santa Clara. 
CAi. lncyie Pharmaceuticals. Inc. (Palo Alto. 
CA). and many do-it-yourself (DIY) arraying 
groups use glass waters or slides. Although 
standard microscope slides may be used, they 
must be prepreparcd to facilitate sucking 
of the DNA to the glass. Several different 
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coatings have been successfully used, includ- 
ing silane and lysine. The coating of slides 
can easily be carried out in the laboratory, 
bur many prefer the convenience of precoated 
slides available from suppliers. 

Once the support matrix has been pre- 
pared, the DNA elements caxi be applied by 
several methods. Asymetrix, Inc. has devel- 
oped a unique photolithographic technology 
for attaching oligonucleotides to glass wafers. 
More commonly. DNA is applied by either 
noncontact or contact printing. Noncontact 
printers can use thermal, solenoid, or piezoelec- 
tric technology to spray aliquots of solution 
onto the support matrix and may be used to 
produce slide or membrane-based arravs. 
Canesian Technologies, Inc. (Irvine, CA) has 
developed nQUAD technology for use in its 
PixSvs printers. The system couples a syringe 
pump with the microsolenoid valve, a combi- 
nation that provides rapid quantitative dispens- 
ing or nanoliter volumes (down to 4.2 nL) over 
a variable volume range. A different approach 
to noncontact printing uses a solid pin and ring 
combination (Genetic MicroSysrems, Inc., 
VC'obum, MA). This system (Figure 1) allows a 
broader range of sample, including cell suspen- 
sions and particulates, because the printing 
head cannot be blocked up in the same wav as 
a spray nozzle. Fluid transfer is controlled in 
this system primarily by the pin dimensions 
and the force of deposition, although the 
nature of the support matrix and the sample 
will also afreet transfer to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot. Split pins 
were one of the first contact-printing devices 
to be reported and are the suggested format 
for DIY arrayers, as described by Brown (J). 
Split pins are small metal pins with a precise 
groove cut vertically in the middle of the pin 
tip. In this system. I-»8 spin pins are posi- 
tioned in the pin-head. The split pins work by 
simpie capillary acnon, not unlike a fountain 
pen— when the pin heads are dipped in the 
sample, liquid is drawn into the pin groove. A 
small (fixed) volume is then deposited each 
time the split pins are gently touched to 
the support matrix. Sample (100-500 pL 
depending on a variety of parameters) can be 
deposited on multiple slides before refilling is 
required^ and array densities of > 2,500 
spots/cm 2 may be produced. The deposit vol- 
ume depends on the split soe. sample fluidi- 
ty, and the speed of printing. Split pins are 
relatively simple to produce and can be made 
in-house if a suitable machine shop is avail- 
able. Alternatively, they can be obtained 
directly from companies such as TeleChcm 
International, Inc. (Sunnyvale. CA). 

Irrespective of their source, printers 
should be run through a preprint sequence 
prior to producing the actual experimental 



arrays: the first 100 or so spots of a new run 
tend to be somewhat variable. Factors effect- 
ing spot reproducibility inciude slide treat- 
ment homogeneity, sampie differences, and 
msrrument errors. Other factors that come 
into play include clean ejection of the drop 
and clogging (nQUAD printing) and 
mechanical variations and long-term alter- 
ation in print-head surface of solid and split 
pins. However, with careful preparation it is 
possible to get a coefficient of variance for 
spot reproducibility below 10%. 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drying (vacuum) of print pins between samples 
is normally effective at reducing sample carry- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chambers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drying rate, which is important in determining 
spot size, quaiiry. and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and the bottom line 
is that they are still in a relatively early stage 
of evolution. 

Array Hybridization 

The hybridization protocol is. pracricaliv 
speaking, relatively straightforward and those 
with previous experience in blottinc should 
have little difficulty. Array hybridizations 
are, in essence, reverse Southern/Northern 
blots— instead of applying a labeled probe to 
the target population of DNA/RNA. the 
labeled population is applied to the probeis). 
V&lth membrane-based arrays, the control and 
treated mRNA populations are normallv con- 
verted to cDNA and labeled with isotope (e.g.. 
" Pi in the process. These labeled popuiauons 
are then hybridized lridcpendenuv to oarailei 
or senai arrays and the hvbridizanon sirnai is 
derected with a phosponmager. A less com- 
monly used alternative to radioactive probes is 
enzymatic detection. The probe may be 
biotinylated, haptenylated L _or have alkaline 
phosphatase/horseradish peroxidase attached. 
Hybridization is detected by enzymatic reac- 
tion yielding a color reaction {4). Differences 
in hybridization signals can be detected bv eve 
or. more accurately, with the help of digital 
imaging and commercially available software. 
The labeling of the test populations for slide- 
based microarrays uses a slightly different 
approach. The probe typically consists of two 
samples of polyA* RNA (usually from a treated 
and a conrxol population) that are converted to 
cDNA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single microarray slide and the resulting 
combined fluorescent signal is scanned. After 




Figure 1. Genetic Microsystems (Woburn. MAI pin 
ring system for printing arrays. The pin ring com- 
bination consists of a circular open ring oriented 
parallel to the samble solution, with a vertical pin 
centered over the ring. When the ring is dipped 
into a solution and lifted, it withdraws an aliquot 
of sample held by surface tension. To spot the 
sample, the pin is driven down through the ring 
and a portion of the solution is transferred to the 
bottom of the pin. The pin continues to move 
downward until the pendant drop of solution 
makes contact with the underlying surface. The 
pin is then lifted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Flowers et al. i 14), with permission 
from Genetic Microsystems. 

normalization, it is possible to determine the 
ratio of fluorescent signals from a single 
h vbridization of a slide-based microarray. 
. cDNA derived from control and treated 
populations of RNA is most commonly 
hybridized to arrays, although subtractive 
hybridization or differential display reactions 
may also be used. Fluorophore- or radiola- 
ocied nucieotmes are directiv incorporated 
into the cDNA in the process of converting 
RNA to cDNA. Alternatively. 5' end-landed 
primers may be used for cDNA synthesis. 
These are labeled with a fluorophore for 
direct visualization of the hybridized array. 
Alternatively, biotin or a hapten may be 
attached to the primer, in which case fluor- 
labcied srreptavidin or antibody must be 
applied before a signal can be generated. The 
most commonly used fluorophores at present 
are cyanine (Cy)3 and Cy5 (Amersham 
Pharmacia Biotech AB, Uppsala, Sweden). 
However, the relative expense of these fluo- 
rescent conjugates has driven a search for 
cheaper alternatives. Fluorescein, rhodamine, 
and Texas red have all been used, and 
companies such as Molecular Probes. Inc. 
(Eugene. OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cy dves. 
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•Analysis Of DNA MicrOarrayS Table I. Advantages and disadvantages of different microarray scanning systems 



Membrane-based arrays arc normally analyzed 
on film or with a phosphorimager. whereas 
chip-basal arrays require more specufaed scan- 
ning devices. These can be divided into three 
main groups: the charge-coupled device camera 
systems, the nonconrocal laser scanners, and the 
conibcal User scanners. The advantages and dis- 
advantages of each system are listed in Table 1 . 

Because a typical spot on a microarray can 
contain > I0 6 molecules, it is dear that a large 
variation in signal strength may occur. 
Current scanners cannor work across this 
many orders of magnitude (4 or 5 is more typ- 
ical). However, the scanning parameters can 
normally be adjusted to collect more or less 
signal, such that two or three scans of the same 
array should permit the detection of rare and 
abundant genes. 

When a microarray is scanned, the fluores- 
cent images are captured by software normally 
included with the scanner. Several commercial 
suppliers provide additional software for quan- 
tifying array images, but the software tools are 
constantly evolving to meet the developing 
needs of researchers, and it is prudent to 
define one s own needs and clarify the exact 
capabilities of the software before its purchase. 
Issues that should be considered include the 
following: 

• Can the software locate offset spots? 

• Can it quantitate across irregular hybridiza- 
tion signals? 

• Can the arrayed genes be programmed in for 
easy identification and location? 

• Can the software connect via the Internet to 
databases containing further information on 
the genefs) of interest? 

One of the key issues raised at the work- 
shop was the sensitivity of microarray technol- 
ogy. Experiments by General Scanning. Inc. 

acenown. MA), have shown that by using 
tne Cy dyes and their scanner. can be 
detected down to levels of < I fluor molecule 
per square micrometer, which translates to 
detecting a rare message at approximately one 
copy per cell or less. - . 

Array Applications 

.-VI though arrays are an emerging technology 
certain to undergo improvement and 
alteration, they have already been applied use- 
fully to a number of modd systems. Arrays are 
at their most powerful when they contain the 
entire genome of the species they are being 
used to study. For mis reason, they, have strong 
support among researchers utilizing yeast and 
Gxcnorhabditii elcgans {$). The genomes of 
both of these species have been sequenced and, 
in the case of yeast, deposited onto arrays for 
examination of gene expression (6,7). With 
both of these species, it is relatively easy to 
perturb individual gene expression. Indeed, C 
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elcgans knockouts can be made simply by 
soaking the worms in an antisense solution of 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships between different 
genes in these simple organisms. This kind of 
approach should help elucidate biochemical 
pathways and genetic control processes, 
deconvolute polygenic interactions, and 
define the architecture of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [University 
of Texas Southwestern Medical Center. 
Dallas. TX (Figure 2)1 Although it is the 
phenorypic result of a single gene knockout 
that is being examined, the efFect of such 
perturbation will almost always be. polygenic. 
Polygenic interactions will become increasing- 
ly important as researchers begin to move 
away from siAgle gene systems when examin- 
ing the nature . of toxicologic responses to 
external stimuli. This is especially important 
in toxicology because the phenocype pro- 
duced by a given environmental insult is 
never the result of the action of a single gene: 
rather, it is a complex interaction of one or 
multiple cellular pathways. Phenomena such 
as quantitative trait ithe continuous variation 
or phenorypei. epistasis ithe erTecr of aiieies of 
one or more genes on the expression of otnex 
genes), and penetrance iproponion of indi- 
viduals of a given genotype that display a par- 
ticular phenorype) will become increasingly 
evident and important as toxicologists push 
toward the ultimate goal of matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transcriptome (the expres- 
sion levd of all the genes in a given cell popula- 
tion) was a use of arrays addressed by several 
speakers. Unfortunately, current gene nomen- 
clature is often confusing in that single genes 
are allocated multiple names (usually as a result 
of independent discovery by different laborato- 
ries), and there was a call for standardization of 
gene nomenclature. Nevertheless, once a tran- 
scriptome has been assembled it can then be 
transrerred onto arrays and used to screen any 
chosen system. The EPA MicroArrav 
Consortium (EPAMAQ is assembling testes 



transcriptomes for human, rat. and mouse. In a 
slighdy dirlerent approach. Nuwavsir ct al. l$) 
describes how the NIEHS assembled what is 
effectively a "toxicoiogical transcriptome" — a 
library of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Clontech 
Laboratories. Inc. (Palo Alto. CA). has begun a 
similar process by- developing stress/toxicology 
filter arrays ot rat. mouse, and human genes. 
Thus, rather than being tissue or cell specific 
these stress/toxicology arrays can be used across 
a variety of model systems to look tor alter- 
ations in the expression of toxicologicallv 
important genes and define the new field of 
toxicogenomics. The potential to identify toxi- 
cant families based on tissue- or cell-specific 
gene expression could revolutionize drug test- 
ing. These molecular signatures or fingerprints 
could not only point to the possible 
toxicity/carcinogenicity of newly discovered 
compounds (Figure 3). but also aid in elucidat- 
ing their mechanism of action through identifi- 
cation of gene expression networks. By exten- 
sion, such signatures could provide easilv iden- 
tifiable biomarkers to assess the degree, time, 
and nature of exposure. 

DNA arrays are primarily a tool for exam- 
ining dinrcrentiaJ sene expression in a eiven 
model. In this context thev are icim c u to as 
dosed systems because they lack the abiiitv of 
o trier differentia] expression technologies, eg., 
differential display and subtractivc hybridiza- 
tion, to detect previously unknown genes not 
present on the array. This would appear to 
limit the power of DNA arrays to the imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in the model system. 
However, the various genome sequencing pro- 
jects have created a new category of 
sequence — the EST — that has partially molli- 
fied this deficiency. ESTs are cDNAs expressed 
in a given tissue that, although they may share 
some degree of sequence similarity to previous- 
ly characterized genes, have not been assigned 
specific genetic identity. By incorporating EST 
clones into an array, it is possible to monitor 
the expression of these unknown genes. This 
can enable the identification of previously 
un characterized genes that may have biologic 




significance in the model system. Fiiter arravs 
from Research Genetics and slide arravs rrom 
Incite Pharmaceuticals both incorporate large 
numbers of ESTs rrom a vanety of species. 

A runner use. of microarrays is the identifi- 
cation or single nucleotide polymorphisms 
i'SNPsi. These genomic variations are abun- 
dant — they occur approximately every ] kb or 
so — and are the basis of restriction fragment 
length polymorphism anaivsis used in forensic 
analysis. Anrymetrtx. Inc., designed chips that 
contain multiple repeats of the same gene 
sequence. Each position is present with alfrbur 
possible bases. After rhe hybridization of the 
sample, the degree of hvbridizarion ro the dif- 
ferent sequences can be measured and the exact 
sequence of the target gene deduced. SXPs are 
thought to be of vital importance in druc 
metabolism and toxicology. For example, sin- 
gle base differences in the regulatory region or 
active site of some genes can account for huee 
differences in the activity of that eene Such 
SNPs are thought to explain whv some people 
are able to metabolize certain xenobiorics bet- 
ter than others. Thus, arravs provide a runner 
tool for the toxicologist investigate the 
nature of susceptible subpopulations and" toxi- 
cologic response. 

There arc still many wrinkles to be ironed 
out before arrays become a standard tool tor 
toxicologiscs. The main issues raised at rhe 
workshop by those with hands-on experience 
were the following: 

• Expense: the cost of purchasing/con tractin* 
this technology is still too great for many 
individual laboratories. 
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Figure 2. Potential effects of gene knockout within 
positively and negatively regulated gene expression 

ne ^? r ? S '' 1 * limitinfl m wild We for «W»on of 
' * *"*»Ple. two-component linear regulatory 
network operating on gene + where 1, is a positive 
enector of t and j n is either a positive or negative 
enector of This network could be deduced by 
examining the consequence of (0 deleong > on the 
ITm K° n . 0f and V *e expression of L 
wou d be decreased or increased depending on 
whether , was a positive or negative regulator 
■nese and other connected components of even 
greater complexity could be revealed by genome, 
wide expression analysis. From Bim>w < /5) 
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• Clones: the logistics of identifying, ©burning, 
and maintaining a set of nonredundam. non- 
contaminated, sequence-vertned. species/ceiJ-' 
tissue/ rlcld-specific clones. 

• Use of inbred strains: where whole-organism 
models are being used, the use of mbrcd 
strains is important to reduce the porenualiv 
confusing effects of the individual variation 
typically seen in oucbred populations. 

• Probe: the need rbr relarively large amounts 
of RNA. which limits the rvpe'of sample 
(e.g.. biopsy) that can be used* .Also, different 
RNA extraction methods can give different 
results. 

• Specificity: the abiiiry to discriminate accu- 
rately between closely related genes (e.g.. the 
OTOchrome pt5Q family) and splice variants. 

•Quantitation: the quantitation of eene 
expression using gene arrays is still open to 
debate. One reason for this is the different 
incorporation of the labeling dyes. Howrver. 
the main difficulty lies in lenowins what to 
normalize against. One option is roinclude a 
large number of so-called housekeeping senes 
in the array. However, the expression of these 
genes orrcn change depending on rhe tissue 
and the toxicant, so it is necessary to charac- 
terize the expression of these genes in the 
model system before utilizing them. This is 
clearly not a viable option when screening 
multiple new compounds. A second option 
is to include on the array genes rVom a nonrc- 
lated species (e.g., a plant gene on an animal 
array) and to spike the probe with synthetic 
RNA(s) complementary ro the sene(s). 
• Reproducibility: this is sometimes question- 
able, and a figure of approximately rwo or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 
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.Again, however, most pcov - ^uki: - 
use or Northern biots or reverse ;ran>jr:r;a>r 
PCR to conrirm finding. 
•^Sensitivity: concerns were voiced abou: :r.r 
number of target moiecuies thar must be pre- 
sent in a sample for mem to be detected on 
the array. 

• Efficiency: reproducible identincauon of I.V 
to :-fold differences in expression was report- 
ed, although the number of genes that 
undergo this level of change and remain 
undetected is open to debate. It is important 
tnar this level of detection be ultimateiv 
acnicved because it is commoniy perceived 
that some important transection factors 
and their regulators respond at such iou lev- 
els. In most cases. 5- to Void was the mim- 
mum change that most were harpv to 
accept. 

• Biomformatics: perhaps the greater concern 
was how to accuraiciy interpret the data with 
the greatest accuracy and efficiencv. The 
biggest headache is trying to identirV net- 
works of gene expression that arc common to 
different treatments or doses. The amount of 
data from a single experiment is huge. It may 
be that, in the rururc. several croups individ*. 
uallv equipped with specialized software algo- 
rithms tor studving their favorite genes or 
gene systems will be able to share the same 
hybridized chips. Thus, arravs could usher in 
a new perspective on collaboration and the 
sharing of data. 

EPAMAC 

Perhap* the main reason most scientists are 
unable to use array technology is the high cost 
involved, whether buving off-the-shelf mem- 
branes, using contract printing services, or 
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producing chips in-house. In view of this, 
researchers at the RTD/NHEER1 initiated 
the EPAMAC This consortium brings 
together scientists rrom the EPA and a num- 
. ber of extramural labs with the aim of devel- 
oping microaxray capability through the shar- 
ing of resources and data. EPAMAC 
researchers are primarily interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a portion 
of the workshop was set aside for EPAMAC 
members to share their ideas on how rhe 
experimental application of microarrays could . 
facilitate their research. One of the central 
areas of interest to EPAMAC members is the 
effect of xenobiotics on male fertility and 
reproductive health. Of greatest concern is 
the effect of exposure during critical periods 
of development and germ cell differentiation 
(9), and how this may compromise sperm 
counts and quality following sexual matura- 
tion { JO). As well as spermatogenic tissue, 
there is also interest in how residual mRNA 
found in mature sperm ( //) could be used as 
an indicator of previous xenobiotic effects (it 
is easier to obtain a semen sample than a res; 
ticular biopsy). Arrays will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididymal 
gene expression profiles, with the aim of 
establishing relationships/associations 
between changes in developmental landmarks 
and the effects on sperm count and quality. 
Cluster, pattern, and other analysis of such 
data should help identify hidden relationships 
between genes that may reveal potential 
mechanisms of action and uncover roles for 
genes with unknown functions. 

Summary 

The full impact of DNA arrays may not be 
*ecn for seveni vears. but the interest shown at 
:his reponai workshop indicates the hioh level 
or interest that they roster. Apart rrom educat- 
ing and advertising the various technologies in 
this field, this workshop brought together a 
number of researchers from the Research 
Triangle Park area who arc already using DNA 
arrays. The interest in sharing ideas and experi- 
ences led to the initiation of a Triangle array 
user's group. 
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Array technology is still in its infancy. This 
means that the hardware is still improving and 
there is no current consensus for standard pro- 
cedures, quantitation, and interpretation. 
Consistency in sporting and scanning arrays is * 
not yet optimized, and this is one of the most 
critical requirements of any experiment. In__ 
addition. one of the dark regions of array tech- 
nology — strife in the courts over who owns 
what portions of it — has further muddled the 
future and is a potential barrier toward the 
development of consensus procedures. 

Perhaps the greatest hurdle tor the applica- 
tion of arrays is the actual interpretation of 
data. No specialists in bioinformatics attended 
the workshop, largely because thev are rare and 
because as yet no one seems dear on the best 
method of approaching data analysis and inter- 
pretation. Cross-referencing results from mul- 
tiple experiments (time, dose, repeats, different 
animals, different species) to identify* common- 
ly expressed genes is a great challenge. In most 
cases, we are still a long way from understand- 
ing how die expression of gene X is related to 
the expression of gene }' and ordering gene 
expression to delineate causal relationships. 

To the ordinary scientist in the typical lab- 
oratory, however, the most immediate prob- 
lem is a lack of affordable instrumentation. 
One can purchase premade membranes at 
relatively affordable prices. .Although these 
may be useful in identifying individual eenes 
to pursue in more detail using other methods, 
the numbers that would be required for even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxicol- 
ogist, there is a need to earn* out multiple 
experiments — dose responses, time curves, 
multiple animals, and repeats. Glass-based 
DNA arrays are most attractive in this context 
because they can be prepared in large batches 
from the same DNA source and accommo- 
date control and treated samples on the same 
chip. Another problem with current off-the- 
shelf arrays is that they often do not contain 
one or more of the particular genes a group is 
interested in. One alternative is to obtain 
and/or produce a set of custom clones and 
have contract printing of membranes or slides 
carried out by a company such as Genomic 
Solutions, Inc (Ann Arbor. MI). This approach 
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is less expensive thar. uvinc ou: jj?:;^. : - 
one s own entire svjtem. airhouch a: ^rr- 
point it might make economy >en<< ;o n::-: 
one s own arrays. 

Finally. DNA arrays are currently a team 
efton. Tney are a technoiogy that uses j widr 
.-range or skills including engineering, stattsucs. 
molecular bioiogy. chemistry, and bioinfor- 
matics. Because most individuals ore skilled in 
only one or perhaps rwo of rhese areas, it 
appears that success with arrays may be best 
expected by teams ot collaborators consisting 
ot individuals having each of these skills. 

Those considering arrav applications mav 
be amused or goaded on by the roliowme 
quote from Fonunt magazine i / J!: 

Microprocessors have reshaped our cconomv. 
spawned vast fortunes and chanced the wav *e live. 
Gene chips could be even oigjer. 

Although this comment may have been 
designed to excite the imagination rather than 
accurately reflect the truth, it is fair to sav that 
the age or functional genomics is upon us. 
DNA arrays look set to be an important tool in 
this new age of biotechnology and will likely 
contribute answers to some of toxicology's 
most fundamental questions. 
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USSN: 09/991,212 

Subject: RE: [Fm<1: Toxic I gy Chip] -^.oo 
Date: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshari.Cymhia" <afshari@-niehs.nih.gov> 
To: "'Diana Hamlet-Cox'" <dianahc&incyte.com> 

you car. se the list of clones that we hav r. our 12>; rhi? a: 
nttr: manue, .mens . mh. ccv r>a?s cues: clor.esrrr . 

We selected a subset of genes (2000K) rnaT we'oe * levee £ * 

response and basic cellular processes and added a se- c^-"-es a--""- 

Zr '' 3 ' nave inclu ° ed a set of control genes (80-) that~we-« s- V e--^*-~ % 

tne I,nu?.. oecause they did not change across a laroe se" o' a— av 
experiments. However, we have found that some of these oe^es~c-'»-~e 
sigr.ricar.t_y after tox treatments and are in the process'cf loo™— a" 
variation or eacn or tnese 80* genes across our experiments. " ~" r 
Our chips are constantly changing and being updated and we hooe -'-a- c— 
oata vil_ _eaa us to what the toxchip should reallv be. * 
Z hope tms answers your question. 
Cindy Afshari 



> From: Diana Mamie z -Cox 

> Sent: Monday, June 26, 2000 8:52 PH 

> To: a fshari&niehs. nih.gov 
> 
> 

> Dear Dr. Afshari, 



Subject: [Fwd: Toxicology Chip] 



Since J have noz yez had a response from Bill Grigg, perhaps he was -o- 
> the right person zo contact M 



> 



> Can you nelp me in this matter? I don't, need to know the sequences 

> necessarily, nut Z would like very much to know whaz zypes oHe^ences 

> are oeing useo, e.g.. GPCRs (more specific?), ion channels. e= ; e£?Ue " ce£ 



> Diana Hamlet-Cox 



> 



> . Original Message 

> Subject: Toxicology Chip 

> Date: Mon. 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahc9 incvze.com> 

> Organization: Incyte Pharmaceuticals 

> To: grigg&niehs. nih.gov 
> 

> Dear Colleague: 
> 

> I am doing literature research on the use of expressed aenes as 
P 4 a ^nnn ZOX2C °i 0gy Mr * ers - and f °™* *™ Press Release' dazed February 
know +^ there is a resource I can access (or you could provide?) -hat 
£™~Iv "V lis ^ of the 12,000 genes that are on your Human ToxChip 
Mic.oa^ay. m particular, I am interested in zhe criteria used zo 

it^nL^ eqUen Z eS f ° r ZhG ToxCAi P' deluding any control sequences 
inc-uaea in the microarray. 



Thank you for your assistance in this reques: 



> Diana Hamlet-Cox, Ph.D. 

> Incyte Genomics, inc. 



> — 
> 

> === 



07/31/2000 10:34 AM 



> This email message is for zhe sole use of zhe ir.ze-ded rezipie-z 5 »- 

> may cor.zair. rzzfidezzial ar.d privileged ir.fcrrazior. s-bjecV :r 

> azzorxey-clier.z privilege. Ar.y *jr.a-z~hcri:ed review, use" disclosure cz 

> diszribxzio- is prohibized. If yon are zoz zhe ir.zer.ded recipier.z. 

> please cor.zazz zhe sender by reply er*ail ar.d deszroy all copies cf zhe 

> original message. 



> 
> 



07/?l/2000 lOijU AM 



Docket No.:, PF-0221-3 DIV 
1 USSN:-09/99U12 



hoc Nasi Acad. Sci. USA 

Vol. 95. pp. 6073-6078. May 1998 

Biochemistry 



t?***^ * e * uence comparison methods with reliable structurally 
identified distant evolutionary relationships structurally 

STCVEN E. BRENNER-^. CYRUS CHOTHIa'. AND TlM J. P. HUBBARD* 



ABSTRACT Pairwise sequence comparison methods bave 
u£*f SSt4 U5ing P roteins wh0 " relationships are known 
reliably from their structures and functions, as described in 
1 %l C °! i daUbaSe [ Mu ™« A- Brenner, S. Hubbard, T. 

6 Chothia C (1995) J. Mol. Biol. 247, 536-540], The evalua. 

£?■ tW £ d w e pr °8 rams BljUrr JAJUchul, S. F„ Gish, W, 
Miller W„ Myers, E. W. & Lipman, D. J. (1990)./. Mot. Biol. 
215, 403-410], wu-BLASH fAltschul, S. F. & Gish, W (1996) 
Methods Enzymol. 266, 460-480], Facta [Pearson, W. R. & 
Upman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85 2444-24481 
1*1 S x^Jf^ 1 1' & M. S. 0981 ) /. Mo)] 

7 , 17 ' , ' 5 " 197] and lheir sco »"8 schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores The E-value statistical scores of SSEARCH and facta are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by blast and wu-BUSH exaggerate significance by orders of 
magnitude SSEARCH, FACTA letup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 

Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreUne the 
sequences issuing forth from genome projects. Given the 
method s central role, it is surprising that overall and relative 
capabi hues of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
me hods being evaluated. However, nearly all known ho- 
rnologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 

The publication cost, of this article were defrayed in part by page chame 
payment. This article must therefore be hereby mark J "tJLSS^ta 
accordance with 18 US.C. 51734 »lely to indicate this fact. 
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Sequence comparison methodologies have evolved rapidly 
so no previously published tests has evaluated modern versions 
J™ commonry used. For example, parameters in 
blast (1) have changed, and wu-blast: (2)-which produces 
gapped alignments-has become available. The latest version 

(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 
The previous reports also have left gaps in our knowledge 

rh/ ^ P i C ' lherC haS ^ no P u °l*hed assessment of 
thresholds for scoring schemes more sophisticated than per. 
centage identity. Thus, the widely discussed statistical scorin* 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonry in use have not been compared. 

Beyond these issues, there is a more fundamental question- 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modem database searching methods^ 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies 

frn^f , ° f i istanl cvoluli °nary relationships in the 
scop: Structural Classification of Proteins database (4) which 
is derived from structural and functional characteristics (5) 
The SCOP database provides a uniquely reliable set of ho- 
mology which are known independently of sequence compar- 
son. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searchinc 
procedures. Further, it can be used to aid interpretation of real 
resufcs*' SC3rchcS and thus P rovide °P lima l and reliable 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6. 7). who compared 
the three most commonly used programs. Of these, the Smith- 
waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have pr0V|ded blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is facta (3). which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1) 
Pearson also considered different parameters for each of these 
programs. 

To lest the methods. Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as beinc 
homologous or unrelated according to their membership of pir 

Abbreviation: EPQ. errors per query. 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worked 
slightly better than fasta, which was in turn more effective 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of BLAST and Fast a. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs* evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and prosite are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sons of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and SSEarch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" ( 1 ). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24. 25, 
28). there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 
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is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the scop database (4. 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scop database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies. such as the globins or the immunoglobu- 
lins, would be recognized as related bv the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (pdb) (30) and created two 
databases. One (pdbwd-B) has domains, which were al) <90% 
identical to any other, whereas (PDB40D-B) had those <AQ% 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1.323 domains, which have 9,044 ordered pairs of 
distant relationships, or -0.5% of the total 1.749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53.988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the sec program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/. and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overTepresentation in the pdb of a small number of 
families (31. 32). whereas pdbmd-b (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0a 13MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided fasta and the 
SSEarch implementation of Smith-Waterman (8). For 
SSEarch and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for blast and wu-blastz 

The "Coverage Vs. Error" Plot To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
error r ***** ° f SCqUCnCCS consislem an acceptable 
Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structural^ determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
cover Operatmg Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
mologs" C ° mparis0n and thc hu * e background of nonho- 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. Tht EPQ measure places a premium on score consis- 
tency; that is. ,t requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 3. Length and percentage identily of alignments of unrelated 
«c MZ in PI ? BQ0D : B: Each P air of nonhomologous proteins found with 
SSEarch ,s plotted as a point whose position indicates the length and 

l C ff rh C !."H nlagC ,dcmit ^ Whhin thc ali * nm ""- Because alignment 
fcngth and percentage .dent.ty are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identify. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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Fig. 4. Reliability of statistical scores in pdbud-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-valucs are reported for ssearch and 
Facta, whereas P-values are shown for blast and wu-blastx If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPO for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-va»ues from 
ssearch and facta arc shown to have good agreement with EPO but 
underestimate the significance slightly, blast and wu-blasT2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB4or>B were similar to those for pdbwd-b 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed bv summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In blast, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35). there is a common 
rule-of- thumb stating that 30% identirv signifies homology. 
Moreover, publications have indicated that 25% identirv can 
be used as a threshold (17, 36). Wc find that these thresholds, 
originally derived years ago, are not supported bv present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity: thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the manv pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aliened regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorlv seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the pdbwd-b analysis in Fig. 3. we learn that 309Z> 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43 5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the hssp equation improves the value of percentage 
identity, but even this measure can find onlv 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1 ). but ln-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the* 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 
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likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a cood. 
slightly conservative estimate of the chances of the two 
quences being found at random in a given query Thus an 
E-yalueof 0.01 indicates that roughly one pair of nonhomoiogs 
of this similarity should be found in every 100 different querieT 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from Blast also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for \% EPQ for this database. Nonethe- 
less these results strongly suggest that the analytic theory is 
ftindamentaJly appropriate, wu-blast? scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ 
Overall Detection of Homo lags and Comparison of AIb<h 
ritnms. The results in Fig. SA and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
iraction of the homologous pairs of sequences in PDB40D-B 
Even ssearch with E-values/ the best protocol tested, could 
rind only 18% of all relationships at a \% EPQ. blast which 
identifies 15%. was the worst performer, whereas 'fasta 
ktup = 1 ts nearly as effective as ssearch. fasta ktup = 2 and 
wu-blast? are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
s ower. ssearch is 25 times slower than blast and 6.5 tim« 

SHY w a " FA fw kt ? 7 wu - BLAm * lightly faster than 
fasta ktup = 2. but the latter has more interpretable scores 
In PDB90D-B. where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. SB). The method which finds that many 
relat.onshtps is wu-blasT2. Consequently, we infer that the 
differences between fasta kup = l, ssearch. and wu-blastz 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
snips have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have sienif- 
leant E-values, but 26 of these involve sequences with <50 
residues Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharplv below 25% 
Smfftn' ,hC S™' $CquenCC div «f«ce of most structurally 
ntr^L evolutionary relationships effectively defeats the ability of 
panwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found 
These results- show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. J 
After completion of this work, a new version of pairwise 
blast was released: blastcp (37). It supports gapped align- 
ments, like wu-blast- and dispenses with sum statistics. Our 
initial tests on blastcp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast but not 
quite equal to that of wi>blast:. 

CONCLUSION 

The general consensus amongst experts (see refs. 7 24 25 27 
and references therein) suggests that the most effective se- 
quence searches are made by (0 using a large current database 
in which the protein sequences have been complexity masked 
and (it) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairlv accurate 
estimates of the significance of each match, but the P-values 



Table 1. Summary of sequence comparison methods with pdeuod-b 






ssearch % identity: within alignment 
SSEARCH % identity: within both 
SSEARCH % identity: HSSP-scaled 
ssearch Smith-Waterman raw scores 
ssearch E-values 
fast a ktup « l E-values 
fast a ktup « 2 E-values 
WU-BLAST2 P-values 
blast P-values 


255 
. 25.5 
25.5 
25.5 
25.5 
3.9 
1.4 
1.1 
1.0 


\K EPO Cuioff 

>709c 
34% 

35% (hssp + 9.8) 
142 
0.03 
0.03 
0.03 
0.003 
0.00016 


Coverage at \9c EPO 
<0.J 

3.0 

4.0 
10.5 
18.4 
17.9 
16.7 
17.5 
14.8 


limes are from large database searches with genome proteins. " 
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extern of errors. Second, ssearch, wu-blastz and fasta 
ktup - 1 perform best, though BLAST and fasta letup = 2 
detect most of the relationships found by the best procedures 
and arc appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.* - 

"Additional and updated information about this work, including 
supplementary figure*, may be found at http://sssjtanford.edu/sssA 
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